Packages

o

com.nvidia.spark.rapids

BatchWithPartitionDataUtils

object BatchWithPartitionDataUtils

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. BatchWithPartitionDataUtils
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def addPartitionValuesToBatch(batch: ColumnarBatch, partitionRows: Array[Long], partitionValues: Array[InternalRow], partitionSchema: StructType, maxGpuColumnSizeBytes: Long): GpuColumnarBatchIterator

    Splits partition data (values and row counts) into smaller batches, ensuring that size of column is less than the maximum column size.

    Splits partition data (values and row counts) into smaller batches, ensuring that size of column is less than the maximum column size. Then, it utilizes these smaller partitioned batches to split the input batch and merges them to generate an Iterator of split ColumnarBatches.

    Using an Iterator ensures that the actual merging does not happen until the batch is required, thus avoiding GPU memory wastage.

    batch

    Input batch, will be closed after the call returns

    partitionRows

    Row numbers collected from the batch, and it should have the same size with "partitionValues"

    partitionValues

    Partition values collected from the batch

    partitionSchema

    Partition schema

    maxGpuColumnSizeBytes

    Maximum number of bytes for a GPU column

    returns

    a new columnar batch iterator with partition values

  5. def addSinglePartitionValueToBatch(batch: ColumnarBatch, partitionValues: InternalRow, partitionSchema: StructType, maxGpuColumnSizeBytes: Long): GpuColumnarBatchIterator

    Adds a single set of partition values to all rows in a ColumnarBatch ensuring that size of column is less than the maximum column size.

    Adds a single set of partition values to all rows in a ColumnarBatch ensuring that size of column is less than the maximum column size.

    returns

    a new columnar batch iterator with partition values

    See also

    com.nvidia.spark.rapids.BatchWithPartitionDataUtils.addPartitionValuesToBatch

  6. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  7. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  8. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  10. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  11. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  12. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  13. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  14. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  15. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  16. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  17. def splitBatchInHalf: (BatchWithPartitionData) ⇒ Seq[BatchWithPartitionData]

    Splits a BatchWithPartitionData into two halves, each containing roughly half the data.

    Splits a BatchWithPartitionData into two halves, each containing roughly half the data. This function is used by the retry framework.

  18. def splitPartitionDataInHalf(partitionedRowsData: Array[PartitionRowData]): Array[Array[PartitionRowData]]

    Splits an array of PartitionRowData into two halves of equal size.

    Splits an array of PartitionRowData into two halves of equal size.

    Example 1,

    Input:
      partitionedRowsData: [ (1000, [ab, cd]), (2000, [def, add]) ]
      target rows: 1500
    
    Result:
      [
         [ (1000, [ab, cd]), (500 [def, add]) ],
         [ (1500, [def, add]) ]
      ]

    Example 2,

    Input:
      partitionedRowsData: [ (1000, [ab, cd]) ]
      target rows: 500
    
    Result:
      [
         [ (500, [ab, cd]) ],
         [ (500, [ab, cd]) ]
      ]
    Note

    This function ensures that splitting is possible even in cases where there is a single large partition until there is only one row.

  19. def splitPartitionDataIntoGroups(partitionRowData: Array[PartitionRowData], partSchema: StructType, maxGpuColumnSizeBytes: Long): Array[Array[PartitionRowData]]

    Splits partitions into smaller batches, ensuring that the batch size for each column does not exceed the maximum column size limit.

    Splits partitions into smaller batches, ensuring that the batch size for each column does not exceed the maximum column size limit.

    Data structures:

    • sizeOfBatch: Array that stores the size of batches for each column.
    • currentBatch: Array used to hold the rows and partition values of the current batch.
    • resultBatches: Array that stores the resulting batches after splitting.

    Algorithm:

    • Initialize sizeOfBatch and resultBatches.
    • Iterate through partRows:
      • Get rowsInPartition - This can either be rows from new partition or rows remaining to be processed if there was a split.
      • Calculate the maximum number of rows we can fit in current batch without exceeding limit.
      • if max rows that fit < rows in partition, we need to split:
        • Append entry (InternalRow, max rows that fit) to the current batch.
        • Append current batch to the result batches.
        • Reset variables.
      • If there was no split,
        • Append entry (InternalRow, rowsInPartition) to the current batch.
        • Update sizeOfBatch with size of partition for each column.
        • This implies all remaining rows can be added in current batch without exceeding limit.

    Example:

    Input:
       partition rows:   [10, 40, 70, 10, 11]
       partition values: [ [abc, ab], [bc, ab], [abc, bc], [aa, cc], [ade, fd] ]
       limit:  300 bytes
    
    Result:
    [
     [ (10, [abc, ab]), (40, [bc, ab]), (63, [abc, bc]) ],
     [ (7, [abc, bc]), (10, [aa, cc]), (11, [ade, fd]) ]
    ]
    returns

    An array of batches, containing (row counts, partition values) pairs, such that each batch's size is less than column size limit.

  20. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  21. def toString(): String
    Definition Classes
    AnyRef → Any
  22. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  23. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  24. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()

Inherited from AnyRef

Inherited from Any

Ungrouped