package stats
- Alphabetic
- Public
- Protected
Type Members
- class ArrayAccumulator extends AccumulatorV2[(Int, Long), Array[Long]]
An accumulator that keeps arrays of counts.
An accumulator that keeps arrays of counts. Counts from multiple partitions are merged by index. -1 indicates a null and is handled using TVL (-1 + N = -1)
- case class DataSize(bytesCompressed: Option[Long] = None, rows: Option[Long] = None, files: Option[Long] = None) extends Product with Serializable
DataSize describes following attributes for data that consists of a list of input files
DataSize describes following attributes for data that consists of a list of input files
- bytesCompressed
total size of the data
- rows
number of rows in the data
- files
number of input files Note: Please don't add any new constructor to this class.
jackson-module-scalaalways picks up the first constructor returned byClass.getConstructorsbut the order of the constructors list is non-deterministic. (SC-13343)
- trait DataSkippingReader extends DataSkippingReaderBase
- trait DataSkippingReaderBase extends DeltaScanGenerator with StatisticsCollection with ReadsMetadataFields with StateCache with DeltaLogging
Adds the ability to use statistics to filter the set of files based on predicates to a org.apache.spark.sql.delta.Snapshot of a given Delta table.
- case class DeltaFileStatistics(stats: Map[String, String]) extends WriteTaskStats with Product with Serializable
A WriteTaskStats that contains a map from file name to the json representation of the collected statistics.
- class DeltaJobStatisticsTracker extends WriteJobStatsTracker
Serializable factory class that holds together all required parameters for being able to instantiate a DeltaTaskStatisticsTracker on an executor.
- case class DeltaScan(version: Long, files: Seq[AddFile], total: DataSize, partition: DataSize, scanned: DataSize)(scannedSnapshot: Snapshot, partitionFilters: ExpressionSet, dataFilters: ExpressionSet, unusedFilters: ExpressionSet, projection: AttributeSet, scanDurationMs: Long, dataSkippingType: DeltaDataSkippingType) extends Product with Serializable
Used to hold details the files and stats for a scan where we have already applied filters and a limit.
- trait DeltaScanGenerator extends DeltaScanGeneratorBase
- trait DeltaScanGeneratorBase extends AnyRef
Trait representing a class that can generate DeltaScan given filters, etc.
- class DeltaTaskStatisticsTracker extends WriteTaskStatsTracker
A per-task (i.e.
A per-task (i.e. one instance per executor) WriteTaskStatsTracker that collects the statistics defined by StatisticsCollection for files that are being written into a delta table.
- case class FilterMetric(numFiles: Long, predicates: Seq[QueryPredicateReport]) extends Product with Serializable
Used to report details about prequery filtering of what data is scanned.
- class PrepareDeltaScan extends Rule[LogicalPlan] with PrepareDeltaScanBase
- trait PrepareDeltaScanBase extends Rule[LogicalPlan] with PredicateHelper with DeltaLogging
Before query planning, we prepare any scans over delta tables by pushing any projections or filters in allowing us to gather more accurate statistics for CBO and metering.
Before query planning, we prepare any scans over delta tables by pushing any projections or filters in allowing us to gather more accurate statistics for CBO and metering.
Note the following - This rule also ensures that all reads from the same delta log use the same snapshot of log thus providing snapshot isolation. - If this rule is invoked within an active OptimisticTransaction, then the scans are generated using the transaction.
- case class PreparedDeltaFileIndex(spark: SparkSession, deltaLog: DeltaLog, path: Path, preparedScan: DeltaScan, partitionSchema: StructType, versionScanned: Option[Long]) extends TahoeFileIndex with DeltaLogging with Product with Serializable
A TahoeFileIndex that uses a prepared scan to return the list of relevant files.
A TahoeFileIndex that uses a prepared scan to return the list of relevant files. This is injected into a query right before query planning by PrepareDeltaScan so that CBO and metering can accurately understand how much data will be read.
- versionScanned
The version of the table that is being scanned, if a specific version has specifically been requested, e.g. by time travel.
- case class QueryPredicateReport(predicate: String, pruningType: String, filesMissingStats: Long, filesDropped: Long) extends Product with Serializable
Used to report metrics on how predicates are used to prune the set of files that are read by a query.
Used to report metrics on how predicates are used to prune the set of files that are read by a query.
- predicate
A user readable version of the predicate.
- pruningType
One of {partition, dataStats, none}.
- filesMissingStats
The number of files that were included due to missing statistics.
- filesDropped
The number of files that were dropped by this predicate.
- trait ReadsMetadataFields extends UsesMetadataFields
A mixin trait that provides access to the stats fields in the transaction log.
- trait StatisticsCollection extends UsesMetadataFields with DeltaLogging
A helper trait that constructs expressions that can be used to collect global and column level statistics for a collection of data, given its schema.
A helper trait that constructs expressions that can be used to collect global and column level statistics for a collection of data, given its schema.
Global statistics (such as the number of records) are stored as top level columns. Per-column statistics (such as min/max) are stored in a struct that mirrors the schema of the data.
To illustrate, here is an example of a data schema along with the schema of the statistics that would be collected.
Data Schema:
|-- a: struct (nullable = true) | |-- b: struct (nullable = true) | | |-- c: long (nullable = true)
Collected Statistics:
|-- stats: struct (nullable = true) | |-- numRecords: long (nullable = false) | |-- minValues: struct (nullable = false) | | |-- a: struct (nullable = false) | | | |-- b: struct (nullable = false) | | | | |-- c: long (nullable = true) | |-- maxValues: struct (nullable = false) | | |-- a: struct (nullable = false) | | | |-- b: struct (nullable = false) | | | | |-- c: long (nullable = true) | |-- nullCount: struct (nullable = false) | | |-- a: struct (nullable = false) | | | |-- b: struct (nullable = false) | | | | |-- c: long (nullable = true)
- trait UsesMetadataFields extends AnyRef
A mixin trait for all interfaces that would like to use information stored in Delta's transaction log.
Value Members
- object DataSize extends Serializable
- object DeltaDataSkippingType extends Enumeration
- object PrepareDeltaScanBase
- object SkippingEligibleColumn
An extractor that matches on access of a skipping-eligible column.
An extractor that matches on access of a skipping-eligible column. We only collect stats for leaf columns, so internal columns of nested types are ineligible for skipping.
NOTE: This check is sufficient for safe use of NULL_COUNT stats, but safe use of MIN and MAX stats requires additional restrictions on column data type (see SkippingEligibleLiteral).
- returns
The path to the column and the column's data type if it exists and is eligible. Otherwise, return None.
- object SkippingEligibleLiteral
An extractor that matches on access of a skipping-eligible Literal.
An extractor that matches on access of a skipping-eligible Literal. Delta tables track min/max stats for a limited set of data types, and only Literals of those types are skipping-eligible.
WARNING: This extractor needs to be kept in sync with StatisticsCollection.statsCollector.
- returns
The Literal, if it is eligible. Otherwise, return None.
- object StatisticsCollection extends DeltaCommand