object BatchWithPartitionDataUtils
- Alphabetic
- By Inheritance
- BatchWithPartitionDataUtils
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
addPartitionValuesToBatch(batch: ColumnarBatch, partitionRows: Array[Long], partitionValues: Array[InternalRow], partitionSchema: StructType, maxGpuColumnSizeBytes: Long): GpuColumnarBatchIterator
Splits partition data (values and row counts) into smaller batches, ensuring that size of column is less than the maximum column size.
Splits partition data (values and row counts) into smaller batches, ensuring that size of column is less than the maximum column size. Then, it utilizes these smaller partitioned batches to split the input batch and merges them to generate an Iterator of split ColumnarBatches.
Using an Iterator ensures that the actual merging does not happen until the batch is required, thus avoiding GPU memory wastage.
- batch
Input batch, will be closed after the call returns
- partitionRows
Row numbers collected from the batch, and it should have the same size with "partitionValues"
- partitionValues
Partition values collected from the batch
- partitionSchema
Partition schema
- maxGpuColumnSizeBytes
Maximum number of bytes for a GPU column
- returns
a new columnar batch iterator with partition values
-
def
addSinglePartitionValueToBatch(batch: ColumnarBatch, partitionValues: InternalRow, partitionSchema: StructType, maxGpuColumnSizeBytes: Long): GpuColumnarBatchIterator
Adds a single set of partition values to all rows in a ColumnarBatch ensuring that size of column is less than the maximum column size.
Adds a single set of partition values to all rows in a ColumnarBatch ensuring that size of column is less than the maximum column size.
- returns
a new columnar batch iterator with partition values
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
splitBatchInHalf: (BatchWithPartitionData) ⇒ Seq[BatchWithPartitionData]
Splits a
BatchWithPartitionDatainto two halves, each containing roughly half the data.Splits a
BatchWithPartitionDatainto two halves, each containing roughly half the data. This function is used by the retry framework. -
def
splitPartitionDataInHalf(partitionedRowsData: Array[PartitionRowData]): Array[Array[PartitionRowData]]
Splits an array of
PartitionRowDatainto two halves of equal size.Splits an array of
PartitionRowDatainto two halves of equal size.Example 1,
Input: partitionedRowsData: [ (1000, [ab, cd]), (2000, [def, add]) ] target rows: 1500 Result: [ [ (1000, [ab, cd]), (500 [def, add]) ], [ (1500, [def, add]) ] ]
Example 2,
Input: partitionedRowsData: [ (1000, [ab, cd]) ] target rows: 500 Result: [ [ (500, [ab, cd]) ], [ (500, [ab, cd]) ] ]
- Note
This function ensures that splitting is possible even in cases where there is a single large partition until there is only one row.
-
def
splitPartitionDataIntoGroups(partitionRowData: Array[PartitionRowData], partSchema: StructType, maxGpuColumnSizeBytes: Long): Array[Array[PartitionRowData]]
Splits partitions into smaller batches, ensuring that the batch size for each column does not exceed the maximum column size limit.
Splits partitions into smaller batches, ensuring that the batch size for each column does not exceed the maximum column size limit.
Data structures:
- sizeOfBatch: Array that stores the size of batches for each column.
- currentBatch: Array used to hold the rows and partition values of the current batch.
- resultBatches: Array that stores the resulting batches after splitting.
Algorithm:
- Initialize
sizeOfBatchandresultBatches. - Iterate through
partRows:- Get rowsInPartition - This can either be rows from new partition or rows remaining to be processed if there was a split.
- Calculate the maximum number of rows we can fit in current batch without exceeding limit.
- if max rows that fit < rows in partition, we need to split:
- Append entry (InternalRow, max rows that fit) to the current batch.
- Append current batch to the result batches.
- Reset variables.
- If there was no split,
- Append entry (InternalRow, rowsInPartition) to the current batch.
- Update sizeOfBatch with size of partition for each column.
- This implies all remaining rows can be added in current batch without exceeding limit.
Example:
Input: partition rows: [10, 40, 70, 10, 11] partition values: [ [abc, ab], [bc, ab], [abc, bc], [aa, cc], [ade, fd] ] limit: 300 bytes Result: [ [ (10, [abc, ab]), (40, [bc, ab]), (63, [abc, bc]) ], [ (7, [abc, bc]), (10, [aa, cc]), (11, [ade, fd]) ] ]
- returns
An array of batches, containing (row counts, partition values) pairs, such that each batch's size is less than column size limit.
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()