class MultiFileOrcPartitionReader extends MultiFileCoalescingPartitionReaderBase with OrcCommonFunctions
- Alphabetic
- By Inheritance
- MultiFileOrcPartitionReader
- OrcCommonFunctions
- OrcCodecWritingHelper
- MultiFileCoalescingPartitionReaderBase
- MultiFileReaderFunctions
- FilePartitionReaderBase
- ScanWithMetrics
- Logging
- PartitionReader
- Closeable
- AutoCloseable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
MultiFileOrcPartitionReader(conf: Configuration, files: Array[PartitionedFile], clippedStripes: Seq[OrcSingleStripeMeta], readDataSchema: StructType, debugDumpPrefix: Option[String], debugDumpAlways: Boolean, maxReadBatchSizeRows: Integer, maxReadBatchSizeBytes: Long, targetBatchSizeBytes: Long, maxGpuColumnSizeBytes: Long, useChunkedReader: Boolean, maxChunkedReaderMemoryUsageSizeBytes: Long, execMetrics: Map[String, GpuMetric], partitionSchema: StructType, numThreads: Int, isCaseSensitive: Boolean)
- conf
Configuration
- files
files to be read
- clippedStripes
the stripe metadata from the original Orc file that has been clipped to only contain the column chunks to be read
- readDataSchema
the Spark schema describing what will be read
- debugDumpPrefix
a path prefix to use for dumping the fabricated Orc data or null
- debugDumpAlways
whether to always debug dump or only on errors
- maxReadBatchSizeRows
soft limit on the maximum number of rows the reader reads per batch
- maxReadBatchSizeBytes
soft limit on the maximum number of bytes the reader reads per batch
- targetBatchSizeBytes
the target size of a batch
- maxGpuColumnSizeBytes
the maximum size of a GPU column
- useChunkedReader
whether to read Parquet by chunks or read all at once
- maxChunkedReaderMemoryUsageSizeBytes
soft limit on the number of bytes of internal memory usage that the reader will use
- execMetrics
metrics
- partitionSchema
schema of partitions
- numThreads
the size of the threadpool
- isCaseSensitive
whether the name check should be case sensitive or not
Type Members
- class OrcCopyStripesRunner extends Callable[(Seq[DataBlockBase], Long)]
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
var
batchIter: Iterator[ColumnarBatch]
- Attributes
- protected
- Definition Classes
- FilePartitionReaderBase
-
def
buildReaderSchema(updatedSchema: TypeDescription, requestedMapping: Option[Array[Int]]): TypeDescription
- Attributes
- protected
- Definition Classes
- OrcCommonFunctions
-
def
buildReaderSchema(ctx: OrcPartitionReaderContext): TypeDescription
Get the ORC schema corresponding to the file being constructed for the GPU
Get the ORC schema corresponding to the file being constructed for the GPU
- Attributes
- protected
- Definition Classes
- OrcCommonFunctions
-
def
calculateEstimatedBlocksOutputSize(batchContext: BatchContext): Long
Calculate the output size according to the block chunks and the schema, and the estimated output size will be used as the initialized size of allocating HostMemoryBuffer
Calculate the output size according to the block chunks and the schema, and the estimated output size will be used as the initialized size of allocating HostMemoryBuffer
Please be note, the estimated size should be at least equal to size of HEAD + Blocks + FOOTER
- batchContext
the batch building context
- returns
Long, the estimated output size
- Definition Classes
- MultiFileOrcPartitionReader → MultiFileCoalescingPartitionReaderBase
-
final
def
calculateFileTailSize(ctx: OrcPartitionReaderContext, footerStartOffset: Long, stripes: Seq[OrcOutputStripe]): Long
- Attributes
- protected
- Definition Classes
- OrcCommonFunctions
-
def
calculateFinalBlocksOutputSize(footerOffset: Long, stripes: Seq[DataBlockBase], batchContext: BatchContext): Long
Calculate the final block output size which will be used to decide if re-allocate HostMemoryBuffer
Calculate the final block output size which will be used to decide if re-allocate HostMemoryBuffer
For now, we still don't know the ORC file footer size, so we can't get the final size.
Since calculateEstimatedBlocksOutputSize has over-estimated the size, it's safe to use it and it will not cause HostMemoryBuffer re-allocating.
- footerOffset
footer offset
- stripes
stripes to be evaluated
- batchContext
the batch building context
- returns
the output size
- Definition Classes
- MultiFileOrcPartitionReader → MultiFileCoalescingPartitionReaderBase
-
def
checkIfNeedToSplitDataBlock(currentBlockInfo: SingleDataBlockInfo, nextBlockInfo: SingleDataBlockInfo): Boolean
To check if the next block will be split into another ColumnarBatch
To check if the next block will be split into another ColumnarBatch
- currentBlockInfo
current SingleDataBlockInfo
- nextBlockInfo
next SingleDataBlockInfo
- returns
true: split the next block into another ColumnarBatch and vice versa
- Definition Classes
- MultiFileOrcPartitionReader → MultiFileCoalescingPartitionReaderBase
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
close(): Unit
- Definition Classes
- FilePartitionReaderBase → Closeable → AutoCloseable
-
val
conf: Configuration
- Definition Classes
- MultiFileOrcPartitionReader → OrcCommonFunctions
-
def
copyStripeData(dataReader: GpuOrcDataReader, out: HostMemoryOutputStream, inputDataRanges: DiskRangeList): Unit
Copy the stripe to the channel
Copy the stripe to the channel
- Attributes
- protected
- Definition Classes
- OrcCommonFunctions
-
def
createBatchContext(chunkedBlocks: LinkedHashMap[Path, ArrayBuffer[DataBlockBase]], clippedSchema: SchemaBase): BatchContext
Return a batch context which will be shared during the process of building a memory file, aka with the following APIs.
Return a batch context which will be shared during the process of building a memory file, aka with the following APIs.
- calculateEstimatedBlocksOutputSize
- writeFileHeader
- getBatchRunner
- calculateFinalBlocksOutputSize
- writeFileFooter It is useful when something is needed by some or all of the above APIs. Children can override this to return a customized batch context.
- chunkedBlocks
mapping of file path to data blocks
- clippedSchema
schema info
- Attributes
- protected
- Definition Classes
- MultiFileCoalescingPartitionReaderBase
-
def
currentMetricsValues(): Array[CustomTaskMetric]
- Definition Classes
- PartitionReader
-
val
debugDumpAlways: Boolean
Whether to always debug dump or only on errors
Whether to always debug dump or only on errors
- Definition Classes
- MultiFileOrcPartitionReader → OrcCommonFunctions
-
val
debugDumpPrefix: Option[String]
Whether debug dumping is enabled and the path prefix where to dump
Whether debug dumping is enabled and the path prefix where to dump
- Definition Classes
- MultiFileOrcPartitionReader → OrcCommonFunctions
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
estimateOutputSizeFromBlocks(blocks: Seq[OrcStripeWithMeta]): Long
- Attributes
- protected
- Definition Classes
- OrcCommonFunctions
-
def
fileSystemBytesRead(): Long
- Attributes
- protected
- Definition Classes
- MultiFileReaderFunctions
- Annotations
- @nowarn()
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
def
finalizeOutputBatch(batch: ColumnarBatch, extraInfo: ExtraInfo): ColumnarBatch
A callback to finalize the output batch.
A callback to finalize the output batch. The batch returned will be the final output batch of the reader's "get" method.
- batch
the batch after decoding, adding partitioned columns.
- extraInfo
the corresponding extra information of the input batch.
- returns
the finalized columnar batch.
- Attributes
- protected
- Definition Classes
- MultiFileCoalescingPartitionReaderBase
-
def
get(): ColumnarBatch
- Definition Classes
- FilePartitionReaderBase → PartitionReader
-
def
getBatchRunner(tc: TaskContext, file: Path, outhmb: HostMemoryBuffer, blocks: ArrayBuffer[DataBlockBase], offset: Long, batchContext: BatchContext): Callable[(Seq[DataBlockBase], Long)]
The sub-class must implement the real file reading logic in a Callable which will be running in a thread pool
The sub-class must implement the real file reading logic in a Callable which will be running in a thread pool
- tc
task context to use
- file
file to be read
- outhmb
the sliced HostMemoryBuffer to hold the blocks, and the implementation is in charge of closing it in sub-class
- blocks
blocks meta info to specify which blocks to be read
- offset
used as the offset adjustment
- batchContext
the batch building context
- returns
Callable[(Seq[DataBlockBase], Long)], which will be submitted to a ThreadPoolExecutor, and the Callable will return a tuple result and result._1 is block meta info with the offset adjusted result._2 is the bytes read
- Definition Classes
- MultiFileOrcPartitionReader → MultiFileCoalescingPartitionReaderBase
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
getFileFormatShortName: String
File format short name used for logging and other things to uniquely identity which file format is being used.
File format short name used for logging and other things to uniquely identity which file format is being used.
- returns
the file format short name
- Definition Classes
- MultiFileOrcPartitionReader → MultiFileCoalescingPartitionReaderBase
-
def
getORCOptionsAndSchema(memFileSchema: TypeDescription, requestedMapping: Option[Array[Int]], readDataSchema: StructType): (ORCOptions, TypeDescription)
- Definition Classes
- OrcCommonFunctions
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
var
isDone: Boolean
- Attributes
- protected
- Definition Classes
- FilePartitionReaderBase
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
isNeedToSplitDataBlock(curMeta: OrcBlockMetaForSplitCheck, nextMeta: OrcBlockMetaForSplitCheck): Boolean
- Attributes
- protected
- Definition Classes
- OrcCommonFunctions
-
def
isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
log: Logger
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logName: String
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
val
metrics: Map[String, GpuMetric]
- Definition Classes
- ScanWithMetrics
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
next(): Boolean
- Definition Classes
- MultiFileCoalescingPartitionReaderBase → PartitionReader
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
readBufferToTablesAndClose(dataBuffer: HostMemoryBuffer, dataSize: Long, clippedSchema: SchemaBase, readSchema: StructType, extraInfo: ExtraInfo): GpuDataProducer[Table]
Sent host memory to GPU to decode
Sent host memory to GPU to decode
- dataBuffer
the data which can be decoded in GPU
- dataSize
data size
- clippedSchema
the clipped schema
- readSchema
the expected schema
- extraInfo
the extra information for specific file format
- returns
Table
- Definition Classes
- MultiFileOrcPartitionReader → MultiFileCoalescingPartitionReaderBase
-
val
readDataSchema: StructType
- Definition Classes
- MultiFileOrcPartitionReader → OrcCommonFunctions
-
def
startNewBufferRetry: Unit
You can reset the target batch size if needed for splits...
You can reset the target batch size if needed for splits...
- Definition Classes
- MultiFileCoalescingPartitionReaderBase
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
implicit
def
toDataStripes(stripes: Seq[DataBlockBase]): Seq[OrcStripeWithMeta]
- Attributes
- protected
- Definition Classes
- OrcCommonFunctions
- implicit def toOrcExtraInfo(in: ExtraInfo): OrcExtraInfo
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
implicit
def
toStripe(block: DataBlockBase): OrcStripeWithMeta
- Attributes
- protected
- Definition Classes
- OrcCommonFunctions
- implicit def toTypeDescription(schema: SchemaBase): TypeDescription
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
withCodecOutputStream[T](ctx: OrcPartitionReaderContext, out: OutputStream)(block: (OrcProtoWriterShim) ⇒ T): T
Executes the provided code block in the codec environment
Executes the provided code block in the codec environment
- Definition Classes
- OrcCodecWritingHelper
-
def
writeFileFooter(buffer: HostMemoryBuffer, bufferSize: Long, footerOffset: Long, stripes: Seq[DataBlockBase], batchContext: BatchContext): (HostMemoryBuffer, Long)
Writer a footer for a specific file format.
Writer a footer for a specific file format. If there is no footer for the file format, just return (hmb, offset)
Please be note, some file formats may re-allocate the HostMemoryBuffer because of the estimated initialized buffer size may be a little smaller than the actual size. So in this case, the hmb should be closed in the implementation.
- buffer
The buffer holding (header + data blocks)
- bufferSize
The total buffer size which equals to size of (header + blocks + footer)
- footerOffset
Where begin to write the footer
- stripes
The data block meta info
- batchContext
The batch building context
- returns
the buffer and the buffer size
- Definition Classes
- MultiFileOrcPartitionReader → MultiFileCoalescingPartitionReaderBase
-
def
writeFileHeader(buffer: HostMemoryBuffer, batchContext: BatchContext): Long
Write a header for a specific file format.
Write a header for a specific file format. If there is no header for the file format, just ignore it and return 0
- buffer
where the header will be written
- batchContext
the batch building context
- returns
how many bytes written
- Definition Classes
- MultiFileOrcPartitionReader → MultiFileCoalescingPartitionReaderBase
-
final
def
writeOrcFileHeader(outStream: HostMemoryOutputStream): Long
- Attributes
- protected
- Definition Classes
- OrcCommonFunctions
-
final
def
writeOrcFileTail(outStream: HostMemoryOutputStream, ctx: OrcPartitionReaderContext, footerStartOffset: Long, stripes: Seq[OrcOutputStripe]): Unit
write the ORC file footer and PostScript
write the ORC file footer and PostScript
- Attributes
- protected
- Definition Classes
- OrcCommonFunctions