case class SparkRepartitionDef(numberOfTasksPerPartition: Int, keyCols: Seq[String] = Seq(), sortCols: Seq[String] = Seq(), filename: Option[String] = None) extends SmartDataLakeLogger with Product with Serializable
This controls repartitioning of the DataFrame before writing with Spark to Hadoop.
When writing multiple partitions of a partitioned DataObject, the number of spark tasks created is equal to numberOfTasksPerPartition multiplied with the number of partitions to write. To spread the records of a partition only over numberOfTasksPerPartition spark tasks, keyCols must be given which are used to derive a task number inside the partition (hashvalue(keyCols) modulo numberOfTasksPerPartition).
When writing to an unpartitioned DataObject or only one partition of a partitioned DataObject, the number of spark tasks created is equal to numberOfTasksPerPartition. Optional keyCols can be used to keep corresponding records together in the same task/file.
- numberOfTasksPerPartition
Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame. This controls how many files are created in each Hadoop partition.
- keyCols
Optional key columns to distribute records over Spark tasks inside a Hadoop partition. If DataObject has Hadoop partitions defined, keyCols must be defined.
- sortCols
Optional columns to sort records inside files created.
- filename
Option filename to rename target file(s). If numberOfTasksPerPartition is greater than 1, multiple files can exist in a directory and a number is inserted into the filename after the first '.'. Example: filename=data.csv -> files created are data.1.csv, data.2.csv, ...
- Annotations
- @Scaladoc()
- Alphabetic
- By Inheritance
- SparkRepartitionDef
- Serializable
- Serializable
- Product
- Equals
- SmartDataLakeLogger
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
SparkRepartitionDef(numberOfTasksPerPartition: Int, keyCols: Seq[String] = Seq(), sortCols: Seq[String] = Seq(), filename: Option[String] = None)
- numberOfTasksPerPartition
Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame. This controls how many files are created in each Hadoop partition.
- keyCols
Optional key columns to distribute records over Spark tasks inside a Hadoop partition. If DataObject has Hadoop partitions defined, keyCols must be defined.
- sortCols
Optional columns to sort records inside files created.
- filename
Option filename to rename target file(s). If numberOfTasksPerPartition is greater than 1, multiple files can exist in a directory and a number is inserted into the filename after the first '.'. Example: filename=data.csv -> files created are data.1.csv, data.2.csv, ...
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native() @HotSpotIntrinsicCandidate()
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- val filename: Option[String]
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- val keyCols: Seq[String]
-
lazy val
logger: Logger
- Attributes
- protected
- Definition Classes
- SmartDataLakeLogger
- Annotations
- @transient()
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
- val numberOfTasksPerPartition: Int
-
def
prepareDataFrame(df: DataFrame, partitions: Seq[String], partitionValues: Seq[PartitionValues], dataObjectId: DataObjectId): DataFrame
- df
DataFrame to repartition
- partitions
DataObjects partition columns
- partitionValues
PartitionsValues to be written with this DataFrame
- dataObjectId
id of DataObject for logging
- Annotations
- @Scaladoc()
- def renameFiles(fileRefs: Seq[FileRef])(implicit filesystem: FileSystem): Unit
- val sortCols: Seq[String]
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
Deprecated Value Members
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] ) @Deprecated
- Deprecated