package hdfs
- Alphabetic
- Public
- All
Type Members
-
case class
PartitionValues(elements: Map[String, Any]) extends Product with Serializable
A partition is defined by values for its partition columns.
A partition is defined by values for its partition columns. It can be represented by a Map. The key of the Map are the partition column names.
- Annotations
- @Scaladoc() @DeveloperApi()
-
case class
SparkRepartitionDef(numberOfTasksPerPartition: Int, keyCols: Seq[String] = Seq(), sortCols: Seq[String] = Seq(), filename: Option[String] = None) extends SmartDataLakeLogger with Product with Serializable
This controls repartitioning of the DataFrame before writing with Spark to Hadoop.
This controls repartitioning of the DataFrame before writing with Spark to Hadoop.
When writing multiple partitions of a partitioned DataObject, the number of spark tasks created is equal to numberOfTasksPerPartition multiplied with the number of partitions to write. To spread the records of a partition only over numberOfTasksPerPartition spark tasks, keyCols must be given which are used to derive a task number inside the partition (hashvalue(keyCols) modulo numberOfTasksPerPartition).
When writing to an unpartitioned DataObject or only one partition of a partitioned DataObject, the number of spark tasks created is equal to numberOfTasksPerPartition. Optional keyCols can be used to keep corresponding records together in the same task/file.
- numberOfTasksPerPartition
Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame. This controls how many files are created in each Hadoop partition.
- keyCols
Optional key columns to distribute records over Spark tasks inside a Hadoop partition. If DataObject has Hadoop partitions defined, keyCols must be defined.
- sortCols
Optional columns to sort records inside files created.
- filename
Option filename to rename target file(s). If numberOfTasksPerPartition is greater than 1, multiple files can exist in a directory and a number is inserted into the filename after the first '.'. Example: filename=data.csv -> files created are data.1.csv, data.2.csv, ...
- Annotations
- @Scaladoc()
Value Members
- object SparkRepartitionDef extends SmartDataLakeLogger with Serializable