class InitialSnapshot extends Snapshot
An initial snapshot with only metadata specified. Useful for creating a DataFrame from an existing parquet table during its conversion to delta.
- Alphabetic
- By Inheritance
- InitialSnapshot
- Snapshot
- DataSkippingReader
- DataSkippingReaderBase
- ReadsMetadataFields
- DeltaScanGenerator
- DeltaScanGeneratorBase
- StatisticsCollection
- UsesMetadataFields
- StateCache
- DeltaLogging
- DatabricksLogging
- DeltaProgressReporter
- Logging
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Instance Constructors
Type Members
- class CachedDS[A] extends AnyRef
- Definition Classes
- StateCache
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final val MAX: String("maxValues")
- Definition Classes
- UsesMetadataFields
- final val MIN: String("minValues")
- Definition Classes
- UsesMetadataFields
- final val NULL_COUNT: String("nullCount")
- Definition Classes
- UsesMetadataFields
- final val NUM_RECORDS: String("numRecords")
- Definition Classes
- UsesMetadataFields
- def aggregationsToComputeState: Map[String, Column]
A Map of alias to aggregations which needs to be done to calculate the
computedStateA Map of alias to aggregations which needs to be done to calculate the
computedState- Attributes
- protected
- Definition Classes
- Snapshot
- def allFiles: Dataset[AddFile]
All of the files present in this Snapshot.
All of the files present in this Snapshot.
- Definition Classes
- Snapshot → DataSkippingReaderBase
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def cacheDS[A](ds: Dataset[A], name: String): CachedDS[A]
Create a CachedDS instance for the given Dataset and the name.
Create a CachedDS instance for the given Dataset and the name.
- Definition Classes
- StateCache
- lazy val checkpointFileIndexOpt: Option[DeltaLogFileIndex]
- Attributes
- protected
- Definition Classes
- Snapshot
- def checkpointSizeInBytes(): Long
- Definition Classes
- Snapshot
- val checksumOpt: Option[VersionChecksum]
- Definition Classes
- Snapshot
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native()
- val columnMappingMode: DeltaColumnMappingMode
- Definition Classes
- DataSkippingReaderBase
- def computeChecksum: VersionChecksum
Computes all the information that is needed by the checksum for the current snapshot.
Computes all the information that is needed by the checksum for the current snapshot. May kick off state reconstruction if needed by any of the underlying fields. Note that it's safe to set txnId to none, since the snapshot doesn't always have a txn attached. E.g. if a snapshot is created by reading a checkpoint, then no txnId is present.
- Definition Classes
- Snapshot
- lazy val computedState: State
Computes some statistics around the transaction log, therefore on the actions made on this Delta table.
Computes some statistics around the transaction log, therefore on the actions made on this Delta table.
- Attributes
- protected
- Definition Classes
- InitialSnapshot → Snapshot
- def constructPartitionFilters(filters: Seq[Expression]): Column
Given the partition filters on the data, rewrite these filters by pointing to the metadata columns.
Given the partition filters on the data, rewrite these filters by pointing to the metadata columns.
- Attributes
- protected
- Definition Classes
- DataSkippingReaderBase
- def dataSchema: StructType
Returns the data schema of the table, the schema of the columns written out to file.
Returns the data schema of the table, the schema of the columns written out to file.
- Definition Classes
- Snapshot → StatisticsCollection
- lazy val deltaFileIndexOpt: Option[DeltaLogFileIndex]
- Attributes
- protected
- Definition Classes
- Snapshot
- def deltaFileSizeInBytes(): Long
- Definition Classes
- Snapshot
- val deltaLog: DeltaLog
- Definition Classes
- InitialSnapshot → Snapshot → DataSkippingReaderBase
- def emptyDF: DataFrame
- Attributes
- protected
- Definition Classes
- Snapshot
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- lazy val fileIndices: Seq[DeltaLogFileIndex]
- Attributes
- protected
- Definition Classes
- Snapshot
- def fileSizeHistogram: Option[FileSizeHistogram]
- Definition Classes
- Snapshot
- def filesForScan(projection: Seq[Attribute], filters: Seq[Expression], keepNumRecords: Boolean): DeltaScan
- Definition Classes
- DataSkippingReaderBase
- def filesForScan(projection: Seq[Attribute], filters: Seq[Expression]): DeltaScan
Gathers files that should be included in a scan based on the given predicates.
Gathers files that should be included in a scan based on the given predicates. Statistics about the amount of data that will be read are gathered and returned.
- Definition Classes
- DataSkippingReaderBase → DeltaScanGeneratorBase
- def filesWithStatsForScan(partitionFilters: Seq[Expression]): DataFrame
Returns a DataFrame for the given partition filters.
Returns a DataFrame for the given partition filters. The schema of returned DataFrame is nearly the same as
AddFile, except that thestatsfield is parsed to a struct from a json string.- Definition Classes
- DataSkippingReaderBase → DeltaScanGeneratorBase
- def filterOnPartitions(partitionFilters: Seq[Expression], keepNumRecords: Boolean): (Seq[AddFile], DataSize)
Get all the files in this table given the partition filter and the corresponding size of the scan.
Get all the files in this table given the partition filter and the corresponding size of the scan.
- keepNumRecords
Also select
stats.numRecordsin the query. This may slow down the query as it has to parse json.
- Attributes
- protected
- Definition Classes
- DataSkippingReaderBase
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable])
- def getAllFiles(keepNumRecords: Boolean): Seq[AddFile]
Get all the files in this table.
Get all the files in this table.
- keepNumRecords
Also select
stats.numRecordsin the query. This may slow down the query as it has to parse json.
- Attributes
- protected
- Definition Classes
- DataSkippingReaderBase
- def getBaseStatsColumn: Column
Returns a Column that references the stats field data skipping should use
Returns a Column that references the stats field data skipping should use
- Definition Classes
- ReadsMetadataFields
- def getCheckpointMetadataOpt: Option[CheckpointMetaData]
- Definition Classes
- Snapshot
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def getDataSkippedFiles(partitionFilters: Column, dataFilters: DataSkippingPredicate, keepNumRecords: Boolean): (Seq[AddFile], Seq[DataSize])
Given the partition and data filters, leverage data skipping statistics to find the set of files that need to be queried.
Given the partition and data filters, leverage data skipping statistics to find the set of files that need to be queried. Returns a tuple of the files and optionally the size of the scan that's generated if there were no filters, if there were only partition filters, and combined effect of partition and data filters respectively.
- Attributes
- protected
- Definition Classes
- DataSkippingReaderBase
- def getNumPartitions: Int
- Attributes
- protected
- Definition Classes
- Snapshot
- def getProperties: HashMap[String, String]
Return the set of properties of the table.
Return the set of properties of the table.
- Definition Classes
- Snapshot
- def getSpecificFilesWithStats(paths: Seq[String]): Seq[AddFile]
Get AddFile (with stats) actions corresponding to given set of paths in the Snapshot.
Get AddFile (with stats) actions corresponding to given set of paths in the Snapshot. If a path doesn't exist in snapshot, it will be ignored and no AddFile will be returned for it.
- paths
Sequence of paths for which we want to get AddFile action
- returns
a sequence of addFiles for the given
paths
- Definition Classes
- DataSkippingReaderBase
- final def getStatsColumnOpt(stat: StatsColumn): Option[Column]
Overload for convenience working with StatsColumn helpers
Overload for convenience working with StatsColumn helpers
- Attributes
- protected
- Definition Classes
- DataSkippingReaderBase
- final def getStatsColumnOpt(statType: String, pathToColumn: Seq[String] = Nil): Option[Column]
Returns an expression to access the given statistics for a specific column, or None if that stats column does not exist.
Returns an expression to access the given statistics for a specific column, or None if that stats column does not exist.
- statType
One of the fields declared by trait
UsesMetadataFields- pathToColumn
The components of the nested column name to get stats for.
- Attributes
- protected
- Definition Classes
- DataSkippingReaderBase
- final def getStatsColumnOrNullLiteral(stat: StatsColumn): Column
Overload for convenience working with StatsColumn helpers
Overload for convenience working with StatsColumn helpers
- Attributes
- protected
- Definition Classes
- DataSkippingReaderBase
- final def getStatsColumnOrNullLiteral(statType: String, pathToColumn: Seq[String] = Nil): Column
Returns an expression to access the given statistics for a specific column, or a NULL literal expression if that column does not exist.
Returns an expression to access the given statistics for a specific column, or a NULL literal expression if that column does not exist.
- Attributes
- protected
- Definition Classes
- DataSkippingReaderBase
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def indexToRelation(index: DeltaLogFileIndex, schema: StructType = logSchema): LogicalRelation
Creates a LogicalRelation with the given schema from a DeltaLogFileIndex.
Creates a LogicalRelation with the given schema from a DeltaLogFileIndex.
- Attributes
- protected
- Definition Classes
- Snapshot
- def init(): Unit
Performs validations during initialization
Performs validations during initialization
- Attributes
- protected
- Definition Classes
- Snapshot
- def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- def initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- def isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- def loadActions: DataFrame
Loads the file indices into a DataFrame that can be used for LogReplay.
Loads the file indices into a DataFrame that can be used for LogReplay.
In addition to the usual nested columns provided by the SingleAction schema, it should provide two additional columns to simplify the log replay process: ACTION_SORT_COL_NAME (which, when sorted in ascending order, will order older actions before newer ones, as required by InMemoryLogReplay); and ADD_STATS_TO_USE_COL_NAME (to handle certain combinations of config settings for delta.checkpoint.writeStatsAsJson and delta.checkpoint.writeStatsAsStruct).
- Attributes
- protected
- Definition Classes
- Snapshot
- def log: Logger
- Attributes
- protected
- Definition Classes
- Logging
- def logConsole(line: String): Unit
- Definition Classes
- DatabricksLogging
- def logDebug(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logDebug(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logError(msg: => String, throwable: Throwable): Unit
- Definition Classes
- Snapshot → Logging
- def logError(msg: => String): Unit
- Definition Classes
- Snapshot → Logging
- def logInfo(msg: => String): Unit
- Definition Classes
- Snapshot → Logging
- def logInfo(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logMissingActionWarning(action: String): Unit
Helper method to log missing actions when state reconstruction checks are not enabled
Helper method to log missing actions when state reconstruction checks are not enabled
- Attributes
- protected
- Definition Classes
- Snapshot
- def logName: String
- Attributes
- protected
- Definition Classes
- Logging
- val logPath: Path
- val logSegment: LogSegment
- Definition Classes
- Snapshot
- def logTrace(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarning(msg: => String, throwable: Throwable): Unit
- Definition Classes
- Snapshot → Logging
- def logWarning(msg: => String): Unit
- Definition Classes
- Snapshot → Logging
- val metadata: Metadata
- Definition Classes
- InitialSnapshot → Snapshot → DataSkippingReaderBase
- val minFileRetentionTimestamp: Long
- Definition Classes
- Snapshot
- val minSetTransactionRetentionTimestamp: Option[Long]
- Definition Classes
- Snapshot
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- lazy val numIndexedCols: Int
Number of columns to collect stats on for data skipping
Number of columns to collect stats on for data skipping
- Definition Classes
- Snapshot → StatisticsCollection
- def numOfFiles: Long
- Definition Classes
- Snapshot → DataSkippingReaderBase
- def numOfMetadata: Long
- Definition Classes
- Snapshot
- def numOfProtocol: Long
- Definition Classes
- Snapshot
- def numOfRemoves: Long
- Definition Classes
- Snapshot
- def numOfSetTransactions: Long
- Definition Classes
- Snapshot
- val path: Path
- Definition Classes
- Snapshot → DataSkippingReaderBase
- def protocol: Protocol
- Definition Classes
- Snapshot
- def recordDeltaEvent(deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty, data: AnyRef = null, path: Option[Path] = None): Unit
Used to record the occurrence of a single event or report detailed, operation specific statistics.
Used to record the occurrence of a single event or report detailed, operation specific statistics.
- path
Used to log the path of the delta table when
deltaLogis null.
- Attributes
- protected
- Definition Classes
- DeltaLogging
- def recordDeltaOperation[A](deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: => A): A
Used to report the duration as well as the success or failure of an operation on a
deltaLog.Used to report the duration as well as the success or failure of an operation on a
deltaLog.- Attributes
- protected
- Definition Classes
- DeltaLogging
- def recordDeltaOperationForTablePath[A](tablePath: String, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: => A): A
Used to report the duration as well as the success or failure of an operation on a
tahoePath.Used to report the duration as well as the success or failure of an operation on a
tahoePath.- Attributes
- protected
- Definition Classes
- DeltaLogging
- def recordEvent(metric: MetricDefinition, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
- Definition Classes
- DatabricksLogging
- def recordFrameProfile[T](group: String, name: String)(thunk: => T): T
- Attributes
- protected
- Definition Classes
- DeltaLogging
- def recordOperation[S](opType: OpType, opTarget: String = null, extraTags: Map[TagDefinition, String], isSynchronous: Boolean = true, alwaysRecordStats: Boolean = false, allowAuthTags: Boolean = false, killJvmIfStuck: Boolean = false, outputMetric: MetricDefinition = null, silent: Boolean = true)(thunk: => S): S
- Definition Classes
- DatabricksLogging
- def recordProductEvent(metric: MetricDefinition with CentralizableMetric, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
- Definition Classes
- DatabricksLogging
- def recordProductUsage(metric: MetricDefinition with CentralizableMetric, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
- Definition Classes
- DatabricksLogging
- def recordUsage(metric: MetricDefinition, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
- Definition Classes
- DatabricksLogging
- def redactedPath: String
- Definition Classes
- Snapshot → DataSkippingReaderBase
- def schema: StructType
Returns the schema of the table.
Returns the schema of the table.
- Definition Classes
- Snapshot → DataSkippingReaderBase
- def setTransactions: Seq[SetTransaction]
- Definition Classes
- Snapshot
- def sizeInBytes: Long
- Definition Classes
- Snapshot → DataSkippingReaderBase
- val snapshotToScan: Snapshot
Snapshot to scan by the DeltaScanGenerator for metadata query optimizations
Snapshot to scan by the DeltaScanGenerator for metadata query optimizations
- Definition Classes
- Snapshot → DeltaScanGeneratorBase
- def spark: SparkSession
- Attributes
- protected
- Definition Classes
- Snapshot → StatisticsCollection → StateCache
- lazy val statCollectionSchema: StructType
statCollectionSchema is the schema that is composed of all the columns that have the stats collected with our current table configuration.
statCollectionSchema is the schema that is composed of all the columns that have the stats collected with our current table configuration.
- Definition Classes
- StatisticsCollection
- def stateDF: DataFrame
The current set of actions in this Snapshot as plain Rows
The current set of actions in this Snapshot as plain Rows
- Definition Classes
- InitialSnapshot → Snapshot
- def stateDS: Dataset[SingleAction]
The current set of actions in this Snapshot as a typed Dataset.
The current set of actions in this Snapshot as a typed Dataset.
- Definition Classes
- InitialSnapshot → Snapshot
- lazy val statsCollector: Column
Returns a struct column that can be used to collect statistics for the current schema of the table.
Returns a struct column that can be used to collect statistics for the current schema of the table. The types we keep stats on must be consistent with DataSkippingReader.SkippingEligibleLiteral.
- Definition Classes
- StatisticsCollection
- lazy val statsSchema: StructType
Returns schema of the statistics collected.
Returns schema of the statistics collected.
- Definition Classes
- StatisticsCollection
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- val timestamp: Long
- Definition Classes
- Snapshot
- def toString(): String
- Definition Classes
- Snapshot → AnyRef → Any
- def tombstones: Dataset[RemoveFile]
All unexpired tombstones.
All unexpired tombstones.
- Definition Classes
- Snapshot
- lazy val transactions: Map[String, Long]
A map to look up transaction version by appId.
A map to look up transaction version by appId.
- Definition Classes
- Snapshot
- def uncache(): Unit
Drop any cached data for this Snapshot.
Drop any cached data for this Snapshot.
- Definition Classes
- StateCache
- def verifyStatsForFilter(referencedStats: Set[StatsColumn]): Column
Returns an expression that can be used to check that the required statistics are present for a given file.
Returns an expression that can be used to check that the required statistics are present for a given file. If any required statistics are missing we must include the corresponding file.
NOTE: We intentionally choose to disable skipping for any file if any required stat is missing, because doing it that way allows us to check each stat only once (rather than once per use). Checking per-use would anyway only help for tables where the number of indexed columns has changed over time, producing add.stats_parsed records with differing schemas. That should be a rare enough case to not worry about optimizing for, given that the fix requires more complex skipping predicates that would penalize the common case.
- Attributes
- protected
- Definition Classes
- DataSkippingReaderBase
- val version: Long
- Definition Classes
- Snapshot → DataSkippingReaderBase
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- def withDmqTag[T](thunk: => T): T
- Attributes
- protected
- Definition Classes
- DeltaLogging
- def withNoStats: DataFrame
All files with the statistics column dropped completely.
All files with the statistics column dropped completely.
- Definition Classes
- DataSkippingReaderBase
- final def withStats: DataFrame
Returns a parsed and cached representation of files with statistics.
Returns a parsed and cached representation of files with statistics.
- returns
cached DataFrame
- Definition Classes
- DataSkippingReaderBase
- def withStatsInternal: DataFrame
- Attributes
- protected
- Definition Classes
- DataSkippingReaderBase
- def withStatusCode[T](statusCode: String, defaultMessage: String, data: Map[String, Any] = Map.empty)(body: => T): T
Report a log to indicate some command is running.
Report a log to indicate some command is running.
- Definition Classes
- DeltaProgressReporter