Packages

case class SparkRepartitionDef(numberOfTasksPerPartition: Int, keyCols: Seq[String] = Seq(), sortCols: Seq[String] = Seq(), filename: Option[String] = None) extends SmartDataLakeLogger with Product with Serializable

This controls repartitioning of the DataFrame before writing with Spark to Hadoop.

When writing multiple partitions of a partitioned DataObject, the number of spark tasks created is equal to numberOfTasksPerPartition multiplied with the number of partitions to write. To spread the records of a partition only over numberOfTasksPerPartition spark tasks, keyCols must be given which are used to derive a task number inside the partition (hashvalue(keyCols) modulo numberOfTasksPerPartition).

When writing to an unpartitioned DataObject or only one partition of a partitioned DataObject, the number of spark tasks created is equal to numberOfTasksPerPartition. Optional keyCols can be used to keep corresponding records together in the same task/file.

numberOfTasksPerPartition

Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame. This controls how many files are created in each Hadoop partition.

keyCols

Optional key columns to distribute records over Spark tasks inside a Hadoop partition. If DataObject has Hadoop partitions defined, keyCols must be defined.

sortCols

Optional columns to sort records inside files created.

filename

Option filename to rename target file(s). If numberOfTasksPerPartition is greater than 1, multiple files can exist in a directory and a number is inserted into the filename after the first '.'. Example: filename=data.csv -> files created are data.1.csv, data.2.csv, ...

Annotations
@Scaladoc()
Linear Supertypes
Serializable, Serializable, Product, Equals, SmartDataLakeLogger, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. SparkRepartitionDef
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. SmartDataLakeLogger
  7. AnyRef
  8. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SparkRepartitionDef(numberOfTasksPerPartition: Int, keyCols: Seq[String] = Seq(), sortCols: Seq[String] = Seq(), filename: Option[String] = None)

    numberOfTasksPerPartition

    Number of Spark tasks to create per partition before writing to DataObject by repartitioning the DataFrame. This controls how many files are created in each Hadoop partition.

    keyCols

    Optional key columns to distribute records over Spark tasks inside a Hadoop partition. If DataObject has Hadoop partitions defined, keyCols must be defined.

    sortCols

    Optional columns to sort records inside files created.

    filename

    Option filename to rename target file(s). If numberOfTasksPerPartition is greater than 1, multiple files can exist in a directory and a number is inserted into the filename after the first '.'. Example: filename=data.csv -> files created are data.1.csv, data.2.csv, ...

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native() @HotSpotIntrinsicCandidate()
  6. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  7. val filename: Option[String]
  8. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  9. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  10. val keyCols: Seq[String]
  11. lazy val logger: Logger
    Attributes
    protected
    Definition Classes
    SmartDataLakeLogger
    Annotations
    @transient()
  12. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  13. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  14. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  15. val numberOfTasksPerPartition: Int
  16. def prepareDataFrame(df: DataFrame, partitions: Seq[String], partitionValues: Seq[PartitionValues], dataObjectId: DataObjectId): DataFrame

    df

    DataFrame to repartition

    partitions

    DataObjects partition columns

    partitionValues

    PartitionsValues to be written with this DataFrame

    dataObjectId

    id of DataObject for logging

    Annotations
    @Scaladoc()
  17. def renameFiles(fileRefs: Seq[FileRef])(implicit filesystem: FileSystem): Unit
  18. val sortCols: Seq[String]
  19. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  20. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  21. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  22. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Deprecated Value Members

  1. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] ) @Deprecated
    Deprecated

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from SmartDataLakeLogger

Inherited from AnyRef

Inherited from Any

Ungrouped