Packages

package stats

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. Protected

Type Members

  1. class ArrayAccumulator extends AccumulatorV2[(Int, Long), Array[Long]]

    An accumulator that keeps arrays of counts.

    An accumulator that keeps arrays of counts. Counts from multiple partitions are merged by index. -1 indicates a null and is handled using TVL (-1 + N = -1)

  2. case class DataSize(bytesCompressed: Option[Long] = None, rows: Option[Long] = None, files: Option[Long] = None) extends Product with Serializable

    DataSize describes following attributes for data that consists of a list of input files

    DataSize describes following attributes for data that consists of a list of input files

    bytesCompressed

    total size of the data

    rows

    number of rows in the data

    files

    number of input files Note: Please don't add any new constructor to this class. jackson-module-scala always picks up the first constructor returned by Class.getConstructors but the order of the constructors list is non-deterministic. (SC-13343)

  3. trait DataSkippingReader extends DataSkippingReaderBase
  4. trait DataSkippingReaderBase extends DeltaScanGenerator with StatisticsCollection with ReadsMetadataFields with StateCache with DeltaLogging

    Adds the ability to use statistics to filter the set of files based on predicates to a org.apache.spark.sql.delta.Snapshot of a given Delta table.

  5. case class DeltaFileStatistics(stats: Map[String, String]) extends WriteTaskStats with Product with Serializable

    A WriteTaskStats that contains a map from file name to the json representation of the collected statistics.

  6. class DeltaJobStatisticsTracker extends WriteJobStatsTracker

    Serializable factory class that holds together all required parameters for being able to instantiate a DeltaTaskStatisticsTracker on an executor.

  7. case class DeltaScan(version: Long, files: Seq[AddFile], total: DataSize, partition: DataSize, scanned: DataSize)(scannedSnapshot: Snapshot, partitionFilters: ExpressionSet, dataFilters: ExpressionSet, unusedFilters: ExpressionSet, projection: AttributeSet, scanDurationMs: Long, dataSkippingType: DeltaDataSkippingType) extends Product with Serializable

    Used to hold details the files and stats for a scan where we have already applied filters and a limit.

  8. trait DeltaScanGenerator extends DeltaScanGeneratorBase
  9. trait DeltaScanGeneratorBase extends AnyRef

    Trait representing a class that can generate DeltaScan given filters, etc.

  10. class DeltaTaskStatisticsTracker extends WriteTaskStatsTracker

    A per-task (i.e.

    A per-task (i.e. one instance per executor) WriteTaskStatsTracker that collects the statistics defined by StatisticsCollection for files that are being written into a delta table.

  11. case class FilterMetric(numFiles: Long, predicates: Seq[QueryPredicateReport]) extends Product with Serializable

    Used to report details about prequery filtering of what data is scanned.

  12. class PrepareDeltaScan extends Rule[LogicalPlan] with PrepareDeltaScanBase
  13. trait PrepareDeltaScanBase extends Rule[LogicalPlan] with PredicateHelper with DeltaLogging

    Before query planning, we prepare any scans over delta tables by pushing any projections or filters in allowing us to gather more accurate statistics for CBO and metering.

    Before query planning, we prepare any scans over delta tables by pushing any projections or filters in allowing us to gather more accurate statistics for CBO and metering.

    Note the following - This rule also ensures that all reads from the same delta log use the same snapshot of log thus providing snapshot isolation. - If this rule is invoked within an active OptimisticTransaction, then the scans are generated using the transaction.

  14. case class PreparedDeltaFileIndex(spark: SparkSession, deltaLog: DeltaLog, path: Path, preparedScan: DeltaScan, partitionSchema: StructType, versionScanned: Option[Long]) extends TahoeFileIndex with DeltaLogging with Product with Serializable

    A TahoeFileIndex that uses a prepared scan to return the list of relevant files.

    A TahoeFileIndex that uses a prepared scan to return the list of relevant files. This is injected into a query right before query planning by PrepareDeltaScan so that CBO and metering can accurately understand how much data will be read.

    versionScanned

    The version of the table that is being scanned, if a specific version has specifically been requested, e.g. by time travel.

  15. case class QueryPredicateReport(predicate: String, pruningType: String, filesMissingStats: Long, filesDropped: Long) extends Product with Serializable

    Used to report metrics on how predicates are used to prune the set of files that are read by a query.

    Used to report metrics on how predicates are used to prune the set of files that are read by a query.

    predicate

    A user readable version of the predicate.

    pruningType

    One of {partition, dataStats, none}.

    filesMissingStats

    The number of files that were included due to missing statistics.

    filesDropped

    The number of files that were dropped by this predicate.

  16. trait ReadsMetadataFields extends UsesMetadataFields

    A mixin trait that provides access to the stats fields in the transaction log.

  17. trait StatisticsCollection extends UsesMetadataFields with DeltaLogging

    A helper trait that constructs expressions that can be used to collect global and column level statistics for a collection of data, given its schema.

    A helper trait that constructs expressions that can be used to collect global and column level statistics for a collection of data, given its schema.

    Global statistics (such as the number of records) are stored as top level columns. Per-column statistics (such as min/max) are stored in a struct that mirrors the schema of the data.

    To illustrate, here is an example of a data schema along with the schema of the statistics that would be collected.

    Data Schema:

    |-- a: struct (nullable = true)
    |    |-- b: struct (nullable = true)
    |    |    |-- c: long (nullable = true)

    Collected Statistics:

    |-- stats: struct (nullable = true)
    |    |-- numRecords: long (nullable = false)
    |    |-- minValues: struct (nullable = false)
    |    |    |-- a: struct (nullable = false)
    |    |    |    |-- b: struct (nullable = false)
    |    |    |    |    |-- c: long (nullable = true)
    |    |-- maxValues: struct (nullable = false)
    |    |    |-- a: struct (nullable = false)
    |    |    |    |-- b: struct (nullable = false)
    |    |    |    |    |-- c: long (nullable = true)
    |    |-- nullCount: struct (nullable = false)
    |    |    |-- a: struct (nullable = false)
    |    |    |    |-- b: struct (nullable = false)
    |    |    |    |    |-- c: long (nullable = true)
  18. trait UsesMetadataFields extends AnyRef

    A mixin trait for all interfaces that would like to use information stored in Delta's transaction log.

Value Members

  1. object DataSize extends Serializable
  2. object DeltaDataSkippingType extends Enumeration
  3. object PrepareDeltaScanBase
  4. object SkippingEligibleColumn

    An extractor that matches on access of a skipping-eligible column.

    An extractor that matches on access of a skipping-eligible column. We only collect stats for leaf columns, so internal columns of nested types are ineligible for skipping.

    NOTE: This check is sufficient for safe use of NULL_COUNT stats, but safe use of MIN and MAX stats requires additional restrictions on column data type (see SkippingEligibleLiteral).

    returns

    The path to the column and the column's data type if it exists and is eligible. Otherwise, return None.

  5. object SkippingEligibleLiteral

    An extractor that matches on access of a skipping-eligible Literal.

    An extractor that matches on access of a skipping-eligible Literal. Delta tables track min/max stats for a limited set of data types, and only Literals of those types are skipping-eligible.

    WARNING: This extractor needs to be kept in sync with StatisticsCollection.statsCollector.

    returns

    The Literal, if it is eligible. Otherwise, return None.

  6. object StatisticsCollection extends DeltaCommand

Ungrouped