class MultiFileCloudParquetPartitionReader extends MultiFileCloudPartitionReaderBase with ParquetPartitionReaderBase
A PartitionReader that can read multiple Parquet files in parallel. This is most efficient running in a cloud environment where the I/O of reading is slow.
Efficiently reading a Parquet split on the GPU requires re-constructing the Parquet file in memory that contains just the column chunks that are needed. This avoids sending unnecessary data to the GPU and saves GPU memory.
- Alphabetic
- By Inheritance
- MultiFileCloudParquetPartitionReader
- ParquetPartitionReaderBase
- MultiFileReaderFunctions
- MultiFileCloudPartitionReaderBase
- FilePartitionReaderBase
- ScanWithMetrics
- Logging
- PartitionReader
- Closeable
- AutoCloseable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
MultiFileCloudParquetPartitionReader(conf: Configuration, files: Array[PartitionedFile], filterFunc: (PartitionedFile) ⇒ ParquetFileInfoWithBlockMeta, isSchemaCaseSensitive: Boolean, debugDumpPrefix: Option[String], debugDumpAlways: Boolean, maxReadBatchSizeRows: Integer, maxReadBatchSizeBytes: Long, targetBatchSizeBytes: Long, maxGpuColumnSizeBytes: Long, useChunkedReader: Boolean, maxChunkedReaderMemoryUsageSizeBytes: Long, execMetrics: Map[String, GpuMetric], partitionSchema: StructType, numThreads: Int, maxNumFileProcessed: Int, ignoreMissingFiles: Boolean, ignoreCorruptFiles: Boolean, useFieldId: Boolean, alluxioPathReplacementMap: Map[String, String], alluxioReplacementTaskTime: Boolean, queryUsesInputFile: Boolean, keepReadsInOrder: Boolean, combineConf: CombineConf)
- conf
the Hadoop configuration
- files
the partitioned files to read
- filterFunc
a function to filter the necessary blocks from a given file
- isSchemaCaseSensitive
whether schema is case sensitive
- debugDumpPrefix
a path prefix to use for dumping the fabricated Parquet data
- debugDumpAlways
whether to debug dump always or only on errors
- maxReadBatchSizeRows
soft limit on the maximum number of rows the reader reads per batch
- maxReadBatchSizeBytes
soft limit on the maximum number of bytes the reader reads per batch
- targetBatchSizeBytes
the target size of the batch
- maxGpuColumnSizeBytes
the maximum size of a GPU column
- useChunkedReader
whether to read Parquet by chunks or read all at once
- maxChunkedReaderMemoryUsageSizeBytes
soft limit on the number of bytes of internal memory usage that the reader will use
- execMetrics
metrics
- partitionSchema
Schema of partitions.
- numThreads
the size of the threadpool
- maxNumFileProcessed
the maximum number of files to read on the CPU side and waiting to be processed on the GPU. This affects the amount of host memory used.
- ignoreMissingFiles
Whether to ignore missing files
- ignoreCorruptFiles
Whether to ignore corrupt files
- useFieldId
Whether to use field id for column matching
- alluxioPathReplacementMap
Map containing mapping of DFS scheme to Alluxio scheme
- alluxioReplacementTaskTime
Whether the Alluxio replacement algorithm is set to task time
- queryUsesInputFile
Whether the query requires the input file name functionality
- keepReadsInOrder
Whether to require the files to be read in the same order as Spark. Defaults to true for formats that don't explicitly handle this.
- combineConf
configs relevant to combination
Type Members
- case class HostMemoryBuffersWithMetaData(partitionedFile: PartitionedFile, origPartitionedFile: Option[PartitionedFile], memBuffersAndSizes: Array[SingleHMBAndMeta], bytesRead: Long, dateRebaseMode: DateTimeRebaseMode, timestampRebaseMode: DateTimeRebaseMode, hasInt96Timestamps: Boolean, clippedSchema: MessageType, readSchema: StructType, allPartValues: Option[Array[(Long, InternalRow)]]) extends HostMemoryBuffersWithMetaDataBase with Product with Serializable
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
val
PARQUET_META_SIZE: Long
- Definition Classes
- ParquetPartitionReaderBase
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
var
batchIter: Iterator[ColumnarBatch]
- Attributes
- protected
- Definition Classes
- FilePartitionReaderBase
-
def
calculateExtraMemoryForParquetFooter(numCols: Int, numBlocks: Int): Int
Calculate an amount of extra memory if we are combining multiple files together.
Calculate an amount of extra memory if we are combining multiple files together. We want to add extra memory because the ColumnChunks saved in the footer have 2 fields file_offset and data_page_offset that get much larger when we are combining files. Here we estimate that by taking the number of columns * number of blocks which should be the number of column chunks and then saying there are 2 fields that could be larger and assume max size of those would be 8 bytes worst case. So we probably allocate to much here but it shouldn't be by a huge amount and its better then having to realloc and copy.
- numCols
the number of columns
- numBlocks
the total number of blocks to be combined
- returns
amount of extra memory to allocate
- Definition Classes
- ParquetPartitionReaderBase
-
def
calculateParquetFooterSize(currentChunkedBlocks: Seq[BlockMetaData], schema: MessageType): Long
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
- Annotations
- @nowarn()
-
def
calculateParquetOutputSize(currentChunkedBlocks: Seq[BlockMetaData], schema: MessageType, handleCoalesceFiles: Boolean): Long
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
def
canUseCombine: Boolean
- Definition Classes
- MultiFileCloudParquetPartitionReader → MultiFileCloudPartitionReaderBase
- def checkIfNeedToSplit(current: HostMemoryBuffersWithMetaData, next: HostMemoryBuffersWithMetaData): Boolean
-
def
checkIfNeedToSplitBlocks(currentDateRebaseMode: DateTimeRebaseMode, nextDateRebaseMode: DateTimeRebaseMode, currentTimestampRebaseMode: DateTimeRebaseMode, nextTimestampRebaseMode: DateTimeRebaseMode, currentSchema: SchemaBase, nextSchema: SchemaBase, currentFilePath: String, nextFilePath: String): Boolean
- Definition Classes
- ParquetPartitionReaderBase
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
close(): Unit
- Definition Classes
- MultiFileCloudPartitionReaderBase → FilePartitionReaderBase → Closeable → AutoCloseable
-
def
combineHMBs(input: Array[HostMemoryBuffersWithMetaDataBase]): HostMemoryBuffersWithMetaDataBase
- Definition Classes
- MultiFileCloudParquetPartitionReader → MultiFileCloudPartitionReaderBase
-
var
combineLeftOverFiles: Option[Array[HostMemoryBuffersWithMetaDataBase]]
- Attributes
- protected
- Definition Classes
- MultiFileCloudPartitionReaderBase
-
def
computeBlockMetaData(blocks: Seq[BlockMetaData], realStartOffset: Long): Seq[BlockMetaData]
Computes new block metadata to reflect where the blocks and columns will appear in the computed Parquet file.
Computes new block metadata to reflect where the blocks and columns will appear in the computed Parquet file.
- blocks
block metadata from the original file(s) that will appear in the computed file
- realStartOffset
starting file offset of the first block
- returns
updated block metadata
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
- Annotations
- @nowarn()
-
val
conf: Configuration
- Definition Classes
- MultiFileCloudParquetPartitionReader → ParquetPartitionReaderBase
-
def
copyBlocksData(filePath: Path, out: HostMemoryOutputStream, blocks: Seq[BlockMetaData], realStartOffset: Long, metrics: Map[String, GpuMetric]): Seq[BlockMetaData]
Copies the data corresponding to the clipped blocks in the original file and compute the block metadata for the output.
Copies the data corresponding to the clipped blocks in the original file and compute the block metadata for the output. The output blocks will contain the same column chunk metadata but with the file offsets updated to reflect the new position of the column data as written to the output.
- out
the output stream to receive the data
- blocks
block metadata from the original file that will appear in the computed file
- realStartOffset
starting file offset of the first block
- returns
updated block metadata corresponding to the output
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
val
copyBufferSize: Int
- Definition Classes
- ParquetPartitionReaderBase
-
def
copyDataRange(range: CopyRange, in: FSDataInputStream, out: HostMemoryOutputStream, copyBuffer: Array[Byte]): Long
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
var
currentFileHostBuffers: Option[HostMemoryBuffersWithMetaDataBase]
- Attributes
- protected
- Definition Classes
- MultiFileCloudPartitionReaderBase
-
def
currentMetricsValues(): Array[CustomTaskMetric]
- Definition Classes
- PartitionReader
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
val
execMetrics: Map[String, GpuMetric]
- Definition Classes
- MultiFileCloudParquetPartitionReader → ParquetPartitionReaderBase
-
def
fileSystemBytesRead(): Long
- Attributes
- protected
- Definition Classes
- MultiFileReaderFunctions
- Annotations
- @nowarn()
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
def
get(): ColumnarBatch
- Definition Classes
- FilePartitionReaderBase → PartitionReader
-
def
getBatchRunner(tc: TaskContext, file: PartitionedFile, origFile: Option[PartitionedFile], conf: Configuration, filters: Array[Filter]): Callable[HostMemoryBuffersWithMetaDataBase]
File reading logic in a Callable which will be running in a thread pool
File reading logic in a Callable which will be running in a thread pool
- tc
task context to use
- file
file to be read
- origFile
optional original unmodified file if replaced with Alluxio
- conf
configuration
- filters
push down filters
- returns
Callable[HostMemoryBuffersWithMetaDataBase]
- Definition Classes
- MultiFileCloudParquetPartitionReader → MultiFileCloudPartitionReaderBase
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
getFileFormatShortName: String
File format short name used for logging and other things to uniquely identity which file format is being used.
File format short name used for logging and other things to uniquely identity which file format is being used.
- returns
the file format short name
- Definition Classes
- MultiFileCloudParquetPartitionReader → MultiFileCloudPartitionReaderBase
-
def
getParquetOptions(readDataSchema: StructType, clippedSchema: MessageType, useFieldId: Boolean): ParquetOptions
- Definition Classes
- ParquetPartitionReaderBase
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
var
isDone: Boolean
- Attributes
- protected
- Definition Classes
- FilePartitionReaderBase
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
val
isSchemaCaseSensitive: Boolean
- Definition Classes
- MultiFileCloudParquetPartitionReader → ParquetPartitionReaderBase
-
def
isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
log: Logger
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logName: String
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
val
metrics: Map[String, GpuMetric]
- Definition Classes
- ScanWithMetrics
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
next(): Boolean
- Definition Classes
- MultiFileCloudPartitionReaderBase → PartitionReader
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
populateCurrentBlockChunk(blockIter: BufferedIterator[BlockMetaData], maxReadBatchSizeRows: Int, maxReadBatchSizeBytes: Long, readDataSchema: StructType): Seq[BlockMetaData]
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
def
readBatches(fileBufsAndMeta: HostMemoryBuffersWithMetaDataBase): Iterator[ColumnarBatch]
Decode HostMemoryBuffers by GPU
Decode HostMemoryBuffers by GPU
- fileBufsAndMeta
the file HostMemoryBuffer read from a PartitionedFile
- returns
Option[ColumnarBatch]
- Definition Classes
- MultiFileCloudParquetPartitionReader → MultiFileCloudPartitionReaderBase
-
def
readPartFile(blocks: Seq[BlockMetaData], clippedSchema: MessageType, filePath: Path): (HostMemoryBuffer, Long, Seq[BlockMetaData])
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
implicit
def
toBlockMetaData(block: DataBlockBase): BlockMetaData
conversions used by multithreaded reader and coalescing reader
conversions used by multithreaded reader and coalescing reader
- Definition Classes
- ParquetPartitionReaderBase
-
implicit
def
toBlockMetaDataSeq(blocks: Seq[DataBlockBase]): Seq[BlockMetaData]
- Definition Classes
- ParquetPartitionReaderBase
-
def
toCudfColumnNames(readDataSchema: StructType, fileSchema: MessageType, isCaseSensitive: Boolean, useFieldId: Boolean): Seq[String]
Take case-sensitive into consideration when getting the data reading column names before sending parquet-formatted buffer to cudf.
Take case-sensitive into consideration when getting the data reading column names before sending parquet-formatted buffer to cudf. Also clips the column names if
useFieldIdis true.- readDataSchema
Spark schema to read
- fileSchema
the schema of the dumped parquet-formatted buffer, already removed unmatched
- isCaseSensitive
if it is case sensitive
- useFieldId
if enabled
spark.sql.parquet.fieldId.read.enabled- returns
a sequence of tuple of column names following the order of readDataSchema
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
implicit
def
toDataBlockBase(blocks: Seq[BlockMetaData]): Seq[DataBlockBase]
- Definition Classes
- ParquetPartitionReaderBase
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
writeFooter(out: OutputStream, blocks: Seq[BlockMetaData], schema: MessageType): Unit
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase