object CDCReader extends DeltaLogging
The API that allows reading Change data between two versions of a table.
The basic abstraction here is the CDC type column defined by CDCReader.CDC_TYPE_COLUMN_NAME. When CDC is enabled, our writer will treat this column as a special partition column even though it's not part of the table. Writers should generate a query that has two types of rows in it: the main data in partition CDC_TYPE_NOT_CDC and the CDC data with the appropriate CDC type value.
org.apache.spark.sql.delta.files.DelayedCommitProtocol does special handling for this column, dispatching the main data to its normal location while the CDC data is sent to AddCDCFile entries.
- Alphabetic
- By Inheritance
- CDCReader
- DeltaLogging
- DatabricksLogging
- DeltaProgressReporter
- Logging
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Type Members
- case class CDCDataSpec[T <: FileAction](version: Long, timestamp: Timestamp, actions: Seq[T]) extends Product with Serializable
- case class CDCVersionDiffInfo(fileChangeDf: DataFrame, numFiles: Long, numBytes: Long) extends Product with Serializable
Represents the changes between some start and end version of a Delta table
Represents the changes between some start and end version of a Delta table
- fileChangeDf
contains all of the file changes (AddFile, RemoveFile, AddCDCFile)
- numFiles
the number of AddFile + RemoveFile + AddCDCFiles that are in the df
- numBytes
the total size of the AddFile + RemoveFile + AddCDCFiles that are in the df
- case class DeltaCDFRelation(schema: StructType, sqlContext: SQLContext, deltaLog: DeltaLog, startingVersion: Option[Long], endingVersion: Option[Long]) extends BaseRelation with PrunedFilteredScan with Product with Serializable
A special BaseRelation wrapper for CDF reads.
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- val CDC_COLUMNS_IN_DATA: Seq[String]
- val CDC_COMMIT_TIMESTAMP: String
- val CDC_COMMIT_VERSION: String
- val CDC_LOCATION: String
- val CDC_PARTITION_COL: String
- val CDC_TYPE_COLUMN_NAME: String
- val CDC_TYPE_DELETE: String
- val CDC_TYPE_INSERT: String
- val CDC_TYPE_NOT_CDC: String
- val CDC_TYPE_UPDATE_POSTIMAGE: String
- val CDC_TYPE_UPDATE_PREIMAGE: String
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def cdcReadSchema(deltaSchema: StructType): StructType
Append CDC metadata columns to the provided schema.
- def changesToBatchDF(deltaLog: DeltaLog, start: Long, end: Long, spark: SparkSession): DataFrame
Get the block of change data from start to end Delta log versions (both sides inclusive).
Get the block of change data from start to end Delta log versions (both sides inclusive). The returned DataFrame has isStreaming set to false.
- def changesToDF(deltaLog: DeltaLog, start: Long, end: Long, changes: Iterator[(Long, Seq[Action])], spark: SparkSession, isStreaming: Boolean = false): CDCVersionDiffInfo
For a sequence of changes(AddFile, RemoveFile, AddCDCFile) create a DataFrame that represents that captured change data between start and end inclusive.
For a sequence of changes(AddFile, RemoveFile, AddCDCFile) create a DataFrame that represents that captured change data between start and end inclusive.
Builds the DataFrame using the following logic: Per each change of type (Long, Seq[Action]) in
changes, iterates over the actions and handles two cases. - If there are any CDC actions, then we ignore the AddFile and RemoveFile actions in that version and create an AddCDCFile instead. - If there are no CDC actions, then we must infer the CDC data from the AddFile and RemoveFile actions, taking only those withdataChange = true.These buffers of AddFile, RemoveFile, and AddCDCFile actions are then used to create corresponding FileIndexes (e.g. TahoeChangeFileIndex), where each is suited to use the given action type to read CDC data. These FileIndexes are then unioned to produce the final DataFrame.
- deltaLog
- DeltaLog for the table for which we are creating a cdc dataFrame
- start
- startingVersion of the changes
- end
- endingVersion of the changes
- changes
- changes is an iterator of all FileActions for a particular commit version.
- spark
- SparkSession
- isStreaming
- indicates whether the DataFrame returned is a streaming DataFrame
- returns
CDCInfo which contains the DataFrame of the changes as well as the statistics related to the changes
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native()
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable])
- def getCDCRelation(spark: SparkSession, deltaLog: DeltaLog, snapshotToUse: Snapshot, partitionFilters: Seq[Expression], conf: SQLConf, options: CaseInsensitiveStringMap): BaseRelation
Get a Relation that represents change data between two snapshots of the table.
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def getTimestampsByVersion(deltaLog: DeltaLog, start: Long, end: Long, spark: SparkSession): Map[Long, Timestamp]
Builds a map from commit versions to associated commit timestamps.
Builds a map from commit versions to associated commit timestamps.
- start
start commit version
- end
end commit version
- def getVersionForCDC(spark: SparkSession, deltaLog: DeltaLog, conf: SQLConf, options: CaseInsensitiveStringMap, versionKey: String, timestampKey: String): Option[Long]
Given timestamp or version this method returns the corresponding version for that timestamp or the version itself.
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- def initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def isCDCEnabledOnTable(metadata: Metadata): Boolean
Determine if the metadata provided has cdc enabled or not.
- def isCDCRead(options: CaseInsensitiveStringMap): Boolean
Based on the read options passed it indicates whether the read was a cdc read or not.
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- def isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- def log: Logger
- Attributes
- protected
- Definition Classes
- Logging
- def logConsole(line: String): Unit
- Definition Classes
- DatabricksLogging
- def logDebug(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logDebug(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logError(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logError(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logInfo(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logInfo(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logName: String
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarning(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarning(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- def recordDeltaEvent(deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty, data: AnyRef = null, path: Option[Path] = None): Unit
Used to record the occurrence of a single event or report detailed, operation specific statistics.
Used to record the occurrence of a single event or report detailed, operation specific statistics.
- path
Used to log the path of the delta table when
deltaLogis null.
- Attributes
- protected
- Definition Classes
- DeltaLogging
- def recordDeltaOperation[A](deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: => A): A
Used to report the duration as well as the success or failure of an operation on a
deltaLog.Used to report the duration as well as the success or failure of an operation on a
deltaLog.- Attributes
- protected
- Definition Classes
- DeltaLogging
- def recordDeltaOperationForTablePath[A](tablePath: String, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: => A): A
Used to report the duration as well as the success or failure of an operation on a
tahoePath.Used to report the duration as well as the success or failure of an operation on a
tahoePath.- Attributes
- protected
- Definition Classes
- DeltaLogging
- def recordEvent(metric: MetricDefinition, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
- Definition Classes
- DatabricksLogging
- def recordFrameProfile[T](group: String, name: String)(thunk: => T): T
- Attributes
- protected
- Definition Classes
- DeltaLogging
- def recordOperation[S](opType: OpType, opTarget: String = null, extraTags: Map[TagDefinition, String], isSynchronous: Boolean = true, alwaysRecordStats: Boolean = false, allowAuthTags: Boolean = false, killJvmIfStuck: Boolean = false, outputMetric: MetricDefinition = null, silent: Boolean = true)(thunk: => S): S
- Definition Classes
- DatabricksLogging
- def recordProductEvent(metric: MetricDefinition with CentralizableMetric, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
- Definition Classes
- DatabricksLogging
- def recordProductUsage(metric: MetricDefinition with CentralizableMetric, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
- Definition Classes
- DatabricksLogging
- def recordUsage(metric: MetricDefinition, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
- Definition Classes
- DatabricksLogging
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- AnyRef → Any
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- def withDmqTag[T](thunk: => T): T
- Attributes
- protected
- Definition Classes
- DeltaLogging
- def withStatusCode[T](statusCode: String, defaultMessage: String, data: Map[String, Any] = Map.empty)(body: => T): T
Report a log to indicate some command is running.
Report a log to indicate some command is running.
- Definition Classes
- DeltaProgressReporter