case class DeduplicateAction(id: ActionId, inputId: DataObjectId, outputId: DataObjectId, transformer: Option[CustomDfTransformerConfig] = None, transformers: Seq[ParsableDfTransformer] = Seq(), columnBlacklist: Option[Seq[String]] = None, columnWhitelist: Option[Seq[String]] = None, additionalColumns: Option[Map[String, String]] = None, filterClause: Option[String] = None, standardizeDatatypes: Boolean = false, ignoreOldDeletedColumns: Boolean = false, ignoreOldDeletedNestedColumns: Boolean = true, updateCapturedColumnOnlyWhenChanged: Boolean = false, mergeModeEnable: Boolean = false, mergeModeAdditionalJoinPredicate: Option[String] = None, breakDataFrameLineage: Boolean = false, persist: Boolean = false, executionMode: Option[ExecutionMode] = None, executionCondition: Option[Condition] = None, metricsFailCondition: Option[String] = None, metadata: Option[ActionMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkOneToOneActionImpl with Product with Serializable
Action to deduplicate a subfeed. Deduplication keeps the last record for every key, also after it has been deleted in the source. DeduplicateAction adds an additional Column TechnicalTableColumn.captured. It contains the timestamp of the last occurrence of the record in the source. This creates lots of updates. Especially when using saveMode.Merge it is better to set TechnicalTableColumn.captured to the last change of the record in the source. Use updateCapturedColumnOnlyWhenChanged = true to enable this optimization.
DeduplicateAction needs a transactional table (e.g. TransactionalSparkTableDataObject) as output with defined primary keys. If output implements CanMergeDataFrame, saveMode.Merge can be enabled by setting mergeModeEnable = true. This allows for much better performance.
- inputId
inputs DataObject
- outputId
output DataObject
- transformer
optional custom transformation to apply
- transformers
optional list of transformations to apply before deduplication. See sparktransformer for a list of included Transformers. The transformations are applied according to the lists ordering.
- columnBlacklist
Remove all columns on blacklist from dataframe
- columnWhitelist
Keep only columns on whitelist in dataframe
- additionalColumns
optional tuples of [column name, spark sql expression] to be added as additional columns to the dataframe. The spark sql expressions are evaluated against an instance of io.smartdatalake.util.misc.DefaultExpressionData.
- ignoreOldDeletedColumns
if true, remove no longer existing columns in Schema Evolution
- ignoreOldDeletedNestedColumns
if true, remove no longer existing columns from nested data types in Schema Evolution. Keeping deleted columns in complex data types has performance impact as all new data in the future has to be converted by a complex function.
- updateCapturedColumnOnlyWhenChanged
Set to true to enable update Column TechnicalTableColumn.captured only if Record has changed in the source, instead of updating it with every execution (default=false). This results in much less records updated with saveMode.Merge.
- mergeModeEnable
Set to true to use saveMode.Merge for much better performance. Output DataObject must implement CanMergeDataFrame if enabled (default = false).
- mergeModeAdditionalJoinPredicate
To optimize performance it might be interesting to limit the records read from the existing table data, e.g. it might be sufficient to use only the last 7 days. Specify a condition to select existing data to be used in transformation as Spark SQL expression. Use table alias 'existing' to reference columns of the existing table data.
- executionMode
optional execution mode for this Action
- executionCondition
optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.
- metricsFailCondition
optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.
- Annotations
- @Scaladoc()
- Alphabetic
- By Inheritance
- DeduplicateAction
- Serializable
- Serializable
- Product
- Equals
- SparkOneToOneActionImpl
- SparkActionImpl
- ActionSubFeedsImpl
- Action
- AtlasExportable
- SmartDataLakeLogger
- DAGNode
- ParsableFromConfig
- SdlConfigObject
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
DeduplicateAction(id: ActionId, inputId: DataObjectId, outputId: DataObjectId, transformer: Option[CustomDfTransformerConfig] = None, transformers: Seq[ParsableDfTransformer] = Seq(), columnBlacklist: Option[Seq[String]] = None, columnWhitelist: Option[Seq[String]] = None, additionalColumns: Option[Map[String, String]] = None, filterClause: Option[String] = None, standardizeDatatypes: Boolean = false, ignoreOldDeletedColumns: Boolean = false, ignoreOldDeletedNestedColumns: Boolean = true, updateCapturedColumnOnlyWhenChanged: Boolean = false, mergeModeEnable: Boolean = false, mergeModeAdditionalJoinPredicate: Option[String] = None, breakDataFrameLineage: Boolean = false, persist: Boolean = false, executionMode: Option[ExecutionMode] = None, executionCondition: Option[Condition] = None, metricsFailCondition: Option[String] = None, metadata: Option[ActionMetadata] = None)(implicit instanceRegistry: InstanceRegistry)
- inputId
inputs DataObject
- outputId
output DataObject
- transformer
optional custom transformation to apply
- transformers
optional list of transformations to apply before deduplication. See sparktransformer for a list of included Transformers. The transformations are applied according to the lists ordering.
- columnBlacklist
Remove all columns on blacklist from dataframe
- columnWhitelist
Keep only columns on whitelist in dataframe
- additionalColumns
optional tuples of [column name, spark sql expression] to be added as additional columns to the dataframe. The spark sql expressions are evaluated against an instance of io.smartdatalake.util.misc.DefaultExpressionData.
- ignoreOldDeletedColumns
if true, remove no longer existing columns in Schema Evolution
- ignoreOldDeletedNestedColumns
if true, remove no longer existing columns from nested data types in Schema Evolution. Keeping deleted columns in complex data types has performance impact as all new data in the future has to be converted by a complex function.
- updateCapturedColumnOnlyWhenChanged
Set to true to enable update Column TechnicalTableColumn.captured only if Record has changed in the source, instead of updating it with every execution (default=false). This results in much less records updated with saveMode.Merge.
- mergeModeEnable
Set to true to use saveMode.Merge for much better performance. Output DataObject must implement CanMergeDataFrame if enabled (default = false).
- mergeModeAdditionalJoinPredicate
To optimize performance it might be interesting to limit the records read from the existing table data, e.g. it might be sufficient to use only the last 7 days. Specify a condition to select existing data to be used in transformation as Spark SQL expression. Use table alias 'existing' to reference columns of the existing table data.
- executionMode
optional execution mode for this Action
- executionCondition
optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.
- metricsFailCondition
optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
addRuntimeEvent(executionId: ExecutionId, phase: ExecutionPhase, state: RuntimeEventState, msg: Option[String] = None, results: Seq[SubFeed] = Seq(), tstmp: LocalDateTime = LocalDateTime.now): Unit
Adds a runtime event for this Action
Adds a runtime event for this Action
- Definition Classes
- Action
- Annotations
- @Scaladoc()
-
def
addRuntimeMetrics(executionId: Option[ExecutionId], dataObjectId: Option[DataObjectId], metric: ActionMetrics): Unit
Adds a runtime metric for this Action
Adds a runtime metric for this Action
- Definition Classes
- Action
- Annotations
- @Scaladoc()
-
def
applyExecutionMode(mainInput: DataObject, mainOutput: DataObject, subFeed: SubFeed, partitionValuesTransform: (Seq[PartitionValues]) ⇒ Map[PartitionValues, PartitionValues])(implicit context: ActionPipelineContext): Unit
Applies the executionMode and stores result in executionModeResult variable
Applies the executionMode and stores result in executionModeResult variable
- Attributes
- protected
- Definition Classes
- Action
- Annotations
- @Scaladoc()
-
def
applyTransformers(transformers: Seq[DfTransformer], inputSubFeed: SparkSubFeed, outputSubFeed: SparkSubFeed)(implicit context: ActionPipelineContext): SparkSubFeed
apply transformer to SubFeed
apply transformer to SubFeed
- Attributes
- protected
- Definition Classes
- SparkOneToOneActionImpl
- Annotations
- @Scaladoc()
-
def
applyTransformers(transformers: Seq[PartitionValueTransformer], partitionValues: Seq[PartitionValues])(implicit context: ActionPipelineContext): Map[PartitionValues, PartitionValues]
apply transformer to partition values
apply transformer to partition values
- Attributes
- protected
- Definition Classes
- SparkActionImpl
- Annotations
- @Scaladoc()
-
def
applyTransformers(transformers: Seq[DfsTransformer], inputPartitionValues: Seq[PartitionValues], inputSubFeeds: Seq[SparkSubFeed], outputSubFeeds: Seq[SparkSubFeed])(implicit context: ActionPipelineContext): Seq[SparkSubFeed]
apply transformer to SubFeeds
apply transformer to SubFeeds
- Attributes
- protected
- Definition Classes
- SparkActionImpl
- Annotations
- @Scaladoc()
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
atlasName: String
- Definition Classes
- Action → AtlasExportable
-
def
atlasQualifiedName(prefix: String): String
- Definition Classes
- AtlasExportable
-
val
breakDataFrameLineage: Boolean
Stop propagating input DataFrame through action and instead get a new DataFrame from DataObject.
Stop propagating input DataFrame through action and instead get a new DataFrame from DataObject. This can help to save memory and performance if the input DataFrame includes many transformations from previous Actions. The new DataFrame will be initialized according to the SubFeed's partitionValues.
- Definition Classes
- DeduplicateAction → SparkActionImpl
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native() @HotSpotIntrinsicCandidate()
-
def
createEmptyDataFrame(dataObject: DataObject with CanCreateDataFrame, subFeed: SparkSubFeed)(implicit context: ActionPipelineContext): DataFrame
- Definition Classes
- SparkActionImpl
-
def
enrichSubFeedDataFrame(input: DataObject with CanCreateDataFrame, subFeed: SparkSubFeed, phase: ExecutionPhase, isRecursive: Boolean = false)(implicit context: ActionPipelineContext): SparkSubFeed
Enriches SparkSubFeed with DataFrame if not existing
Enriches SparkSubFeed with DataFrame if not existing
- input
input data object.
- subFeed
input SubFeed.
- phase
current execution phase
- isRecursive
true if this input is a recursive input
- Definition Classes
- SparkActionImpl
- Annotations
- @Scaladoc()
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
exec(subFeeds: Seq[SubFeed])(implicit context: ActionPipelineContext): Seq[SubFeed]
Executes the main task of an action.
Executes the main task of an action. In this step the data of the SubFeed's is moved from Input- to Output-DataObjects.
- subFeeds
SparkSubFeed's to be processed
- returns
processed SparkSubFeed's
- Definition Classes
- ActionSubFeedsImpl → Action
-
val
executionCondition: Option[Condition]
execution condition for this action.
execution condition for this action.
- Definition Classes
- DeduplicateAction → Action
-
val
executionConditionResult: Option[(Boolean, Option[String])]
- Attributes
- protected
- Definition Classes
- Action
-
val
executionMode: Option[ExecutionMode]
execution mode for this action.
execution mode for this action.
- Definition Classes
- DeduplicateAction → Action
-
val
executionModeResult: Option[Try[Option[ExecutionModeResult]]]
- Attributes
- protected
- Definition Classes
- Action
-
def
factory: FromConfigFactory[Action]
Returns the factory that can parse this type (that is, type
CO).Returns the factory that can parse this type (that is, type
CO).Typically, implementations of this method should return the companion object of the implementing class. The companion object in turn should implement FromConfigFactory.
- returns
the factory (object) for this class.
- Definition Classes
- DeduplicateAction → ParsableFromConfig
-
def
filterDataFrame(df: DataFrame, partitionValues: Seq[PartitionValues], genericFilter: Option[Column]): DataFrame
Filter DataFrame with given partition values
Filter DataFrame with given partition values
- df
DataFrame to filter
- partitionValues
partition values to use as filter condition
- genericFilter
filter expression to apply
- returns
filtered DataFrame
- Definition Classes
- SparkActionImpl
- Annotations
- @Scaladoc()
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
def
getDataObjectsState: Seq[DataObjectState]
Get potential state of input DataObjects when executionMode is DataObjectStateIncrementalMode.
Get potential state of input DataObjects when executionMode is DataObjectStateIncrementalMode.
- Definition Classes
- Action
- Annotations
- @Scaladoc()
-
def
getInputDataObject[T <: DataObject](id: DataObjectId)(implicit arg0: ClassTag[T], arg1: scala.reflect.api.JavaUniverse.TypeTag[T], registry: InstanceRegistry): T
- Attributes
- protected
- Definition Classes
- Action
-
def
getLatestRuntimeEventState: Option[RuntimeEventState]
Get latest runtime state
Get latest runtime state
- Definition Classes
- Action
- Annotations
- @Scaladoc()
-
def
getMainInput(inputSubFeeds: Seq[SubFeed])(implicit context: ActionPipelineContext): DataObject
- Attributes
- protected
- Definition Classes
- ActionSubFeedsImpl
-
def
getMainPartitionValues(inputSubFeeds: Seq[SubFeed])(implicit context: ActionPipelineContext): Seq[PartitionValues]
- Attributes
- protected
- Definition Classes
- ActionSubFeedsImpl
-
def
getOutputDataObject[T <: DataObject](id: DataObjectId)(implicit arg0: ClassTag[T], arg1: scala.reflect.api.JavaUniverse.TypeTag[T], registry: InstanceRegistry): T
- Attributes
- protected
- Definition Classes
- Action
-
def
getRuntimeDataImpl: RuntimeData
- Definition Classes
- SparkActionImpl → Action
-
def
getRuntimeInfo(executionId: Option[ExecutionId] = None): Option[RuntimeInfo]
Get summarized runtime information for a given ExecutionId.
Get summarized runtime information for a given ExecutionId.
- executionId
ExecutionId to get runtime information for. If empty runtime information for last ExecutionId are returned.
- Definition Classes
- Action
- Annotations
- @Scaladoc()
-
def
getRuntimeMetrics(executionId: Option[ExecutionId] = None): Map[DataObjectId, Option[ActionMetrics]]
Get the latest metrics for all DataObjects and a given SDLExecutionId.
Get the latest metrics for all DataObjects and a given SDLExecutionId.
- executionId
ExecutionId to get metrics for. If empty metrics for last ExecutionId are returned.
- Definition Classes
- Action
- Annotations
- @Scaladoc()
-
def
getTransformers(transformation: Option[CustomDfTransformerConfig], columnBlacklist: Option[Seq[String]], columnWhitelist: Option[Seq[String]], additionalColumns: Option[Map[String, String]], standardizeDatatypes: Boolean, additionalTransformers: Seq[DfTransformer], filterClauseExpr: Option[Column] = None)(implicit context: ActionPipelineContext): Seq[DfTransformer]
Combines all transformations into a list of DfTransformers
Combines all transformations into a list of DfTransformers
- Definition Classes
- SparkOneToOneActionImpl
- Annotations
- @Scaladoc()
-
val
id: ActionId
A unique identifier for this instance.
A unique identifier for this instance.
- Definition Classes
- DeduplicateAction → Action → SdlConfigObject
- val ignoreOldDeletedColumns: Boolean
- val ignoreOldDeletedNestedColumns: Boolean
-
final
def
init(subFeeds: Seq[SubFeed])(implicit context: ActionPipelineContext): Seq[SubFeed]
Initialize Action with SubFeed's to be processed.
Initialize Action with SubFeed's to be processed. In this step the execution mode is evaluated and the result stored for the exec phase. If successful - the DAG can be built - Spark DataFrame lineage can be built
- subFeeds
SparkSubFeed's to be processed
- returns
processed SparkSubFeed's
- Definition Classes
- ActionSubFeedsImpl → Action
-
val
input: DataObject with CanCreateDataFrame
Input DataObject which can CanCreateDataFrame
Input DataObject which can CanCreateDataFrame
- Definition Classes
- DeduplicateAction → SparkOneToOneActionImpl
- val inputId: DataObjectId
-
def
inputIdsToIgnoreFilter: Seq[DataObjectId]
- Definition Classes
- ActionSubFeedsImpl
-
val
inputs: Seq[DataObject with CanCreateDataFrame]
Input DataObjects To be implemented by subclasses
Input DataObjects To be implemented by subclasses
- Definition Classes
- DeduplicateAction → SparkActionImpl → Action
-
def
isAsynchronous: Boolean
If this Action should be run as asynchronous streaming process
If this Action should be run as asynchronous streaming process
- Definition Classes
- SparkActionImpl → Action
-
def
isAsynchronousProcessStarted: Boolean
- Definition Classes
- SparkActionImpl → Action
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
def
logWritingFinished(subFeed: SparkSubFeed, noData: Option[Boolean], duration: Duration)(implicit context: ActionPipelineContext): Unit
- Attributes
- protected
- Definition Classes
- ActionSubFeedsImpl
-
def
logWritingStarted(subFeed: SparkSubFeed)(implicit context: ActionPipelineContext): Unit
- Attributes
- protected
- Definition Classes
- ActionSubFeedsImpl
-
lazy val
logger: Logger
- Attributes
- protected
- Definition Classes
- SmartDataLakeLogger
- Annotations
- @transient()
-
def
mainInputId: Option[DataObjectId]
- Definition Classes
- ActionSubFeedsImpl
-
lazy val
mainOutput: DataObject
- Attributes
- protected
- Definition Classes
- ActionSubFeedsImpl
-
def
mainOutputId: Option[DataObjectId]
- Definition Classes
- ActionSubFeedsImpl
- val mergeModeAdditionalJoinPredicate: Option[String]
- val mergeModeEnable: Boolean
-
val
metadata: Option[ActionMetadata]
Additional metadata for the Action
Additional metadata for the Action
- Definition Classes
- DeduplicateAction → Action
-
val
metricsFailCondition: Option[String]
Spark SQL condition evaluated as where-clause against dataframe of metrics.
Spark SQL condition evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.
- Definition Classes
- DeduplicateAction → Action
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
nodeId: String
provide an implementation of the DAG node id
provide an implementation of the DAG node id
- Definition Classes
- Action → DAGNode
- Annotations
- @Scaladoc()
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
val
output: TransactionalSparkTableDataObject
Output DataObject which can CanWriteDataFrame
Output DataObject which can CanWriteDataFrame
- Definition Classes
- DeduplicateAction → SparkOneToOneActionImpl
- val outputId: DataObjectId
-
val
outputs: Seq[TransactionalSparkTableDataObject]
Output DataObjects To be implemented by subclasses
Output DataObjects To be implemented by subclasses
- Definition Classes
- DeduplicateAction → SparkActionImpl → Action
-
val
persist: Boolean
Force persisting input DataFrame's on Disk.
Force persisting input DataFrame's on Disk. This improves performance if dataFrame is used multiple times in the transformation and can serve as a recovery point in case a task get's lost. Note that DataFrames are persisted automatically by the previous Action if later Actions need the same data. To avoid this behaviour set breakDataFrameLineage=false.
- Definition Classes
- DeduplicateAction → SparkActionImpl
-
final
def
postExec(inputSubFeeds: Seq[SubFeed], outputSubFeeds: Seq[SubFeed])(implicit context: ActionPipelineContext): Unit
Executes operations needed after executing an action.
Executes operations needed after executing an action. In this step any task on Input- or Output-DataObjects needed after the main task is executed, e.g. JdbcTableDataObjects postWriteSql or CopyActions deleteInputData.
- Definition Classes
- SparkOneToOneActionImpl → SparkActionImpl → ActionSubFeedsImpl → Action
-
def
postExecFailed(implicit context: ActionPipelineContext): Unit
Executes operations needed to cleanup after executing an action failed.
Executes operations needed to cleanup after executing an action failed.
- Definition Classes
- SparkActionImpl → Action
-
def
postExecSubFeed(inputSubFeed: SubFeed, outputSubFeed: SubFeed)(implicit context: ActionPipelineContext): Unit
Executes operations needed after executing an action for the SubFeed.
Executes operations needed after executing an action for the SubFeed. Can be implemented by sub classes.
- Definition Classes
- SparkOneToOneActionImpl
- Annotations
- @Scaladoc()
-
def
postprocessOutputSubFeedCustomized(subFeed: SparkSubFeed)(implicit context: ActionPipelineContext): SparkSubFeed
Implement additional processing logic for SubFeeds after transformation.
Implement additional processing logic for SubFeeds after transformation. Can be implemented by subclass.
- Definition Classes
- SparkActionImpl → ActionSubFeedsImpl
-
def
postprocessOutputSubFeeds(subFeeds: Seq[SparkSubFeed])(implicit context: ActionPipelineContext): Seq[SparkSubFeed]
- Definition Classes
- ActionSubFeedsImpl
-
def
preExec(subFeeds: Seq[SubFeed])(implicit context: ActionPipelineContext): Unit
Executes operations needed before executing an action.
Executes operations needed before executing an action. In this step any phase on Input- or Output-DataObjects needed before the main task is executed, e.g. JdbcTableDataObjects preWriteSql
- Definition Classes
- SparkActionImpl → Action
-
def
preInit(subFeeds: Seq[SubFeed], dataObjectsState: Seq[DataObjectState])(implicit context: ActionPipelineContext): Unit
Checks before initalization of Action In this step execution condition is evaluated and Action init is skipped if result is false.
Checks before initalization of Action In this step execution condition is evaluated and Action init is skipped if result is false.
- Definition Classes
- Action
- Annotations
- @Scaladoc()
-
def
prepare(implicit context: ActionPipelineContext): Unit
Prepare DataObjects prerequisites.
Prepare DataObjects prerequisites. In this step preconditions are prepared & tested: - connections can be created - needed structures exist, e.g Kafka topic or Jdbc table
This runs during the "prepare" phase of the DAG.
- Definition Classes
- ActionSubFeedsImpl → Action
-
def
prepareInputSubFeed(input: DataObject with CanCreateDataFrame, subFeed: SparkSubFeed, ignoreFilters: Boolean = false)(implicit context: ActionPipelineContext): SparkSubFeed
Applies changes to a SubFeed from a previous action in order to be used as input for this actions transformation.
Applies changes to a SubFeed from a previous action in order to be used as input for this actions transformation.
- Definition Classes
- SparkActionImpl
- Annotations
- @Scaladoc()
-
def
prepareInputSubFeeds(subFeeds: Seq[SubFeed])(implicit context: ActionPipelineContext): (Seq[SparkSubFeed], Seq[SparkSubFeed])
- Definition Classes
- ActionSubFeedsImpl
-
def
preprocessInputSubFeedCustomized(subFeed: SparkSubFeed, ignoreFilters: Boolean, isRecursive: Boolean)(implicit context: ActionPipelineContext): SparkSubFeed
Implement additional preprocess logic for SubFeeds before transformation Can be implemented by subclass.
Implement additional preprocess logic for SubFeeds before transformation Can be implemented by subclass.
- isRecursive
If subfeed is recursive (input & output)
- Attributes
- protected
- Definition Classes
- SparkActionImpl → ActionSubFeedsImpl
-
lazy val
prioritizedMainInputCandidates: Seq[DataObject]
- Attributes
- protected
- Definition Classes
- ActionSubFeedsImpl
-
val
recursiveInputs: Seq[TransactionalSparkTableDataObject]
Recursive Inputs are DataObjects that are used as Output and Input in the same action.
Recursive Inputs are DataObjects that are used as Output and Input in the same action. This is usually prohibited as it creates loops in the DAG. In special cases this makes sense, i.e. when building a complex comparision/update logic.
Usage: add DataObjects used as Output and Input as outputIds and recursiveInputIds, but not as inputIds.
- Definition Classes
- DeduplicateAction → SparkActionImpl → Action
-
def
saveModeOptions: Option[SaveModeOptions]
Override and parametrize saveMode in output DataObject configurations when writing to DataObjects.
Override and parametrize saveMode in output DataObject configurations when writing to DataObjects.
- Definition Classes
- DeduplicateAction → SparkActionImpl
-
def
setSparkJobMetadata(operation: Option[String] = None)(implicit context: ActionPipelineContext): Unit
Sets the util job description for better traceability in the Spark UI
Sets the util job description for better traceability in the Spark UI
Note: This sets Spark local properties, which are propagated to the respective executor tasks. We rely on this to match metrics back to Actions and DataObjects. As writing to a DataObject on the Driver happens uninterrupted in the same exclusive thread, this is suitable.
- operation
phase description (be short...)
- Definition Classes
- Action
- Annotations
- @Scaladoc()
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
final
def
toString(executionId: Option[ExecutionId]): String
- Definition Classes
- Action
-
final
def
toString(): String
This is displayed in ascii graph visualization
This is displayed in ascii graph visualization
- Definition Classes
- Action → AnyRef → Any
- Annotations
- @Scaladoc()
-
def
toStringMedium: String
- Definition Classes
- Action
-
def
toStringShort: String
- Definition Classes
- Action
-
def
transform(inputSubFeed: SparkSubFeed, outputSubFeed: SparkSubFeed)(implicit context: ActionPipelineContext): SparkSubFeed
Transform a SparkSubFeed.
Transform a SparkSubFeed. To be implemented by subclasses.
- inputSubFeed
SparkSubFeed to be transformed
- outputSubFeed
SparkSubFeed to be enriched with transformed result
- returns
transformed output SparkSubFeed
- Definition Classes
- DeduplicateAction → SparkOneToOneActionImpl
-
final
def
transform(inputSubFeeds: Seq[SparkSubFeed], outputSubFeeds: Seq[SparkSubFeed])(implicit context: ActionPipelineContext): Seq[SparkSubFeed]
Transform subfeed content To be implemented by subclass.
Transform subfeed content To be implemented by subclass.
- Definition Classes
- SparkOneToOneActionImpl → ActionSubFeedsImpl
-
def
transformPartitionValues(partitionValues: Seq[PartitionValues])(implicit context: ActionPipelineContext): Map[PartitionValues, PartitionValues]
Transform partition values.
Transform partition values. Can be implemented by subclass.
- Definition Classes
- DeduplicateAction → ActionSubFeedsImpl
- val transformers: Seq[ParsableDfTransformer]
- val updateCapturedColumnOnlyWhenChanged: Boolean
-
def
validateAndUpdateSubFeedCustomized(output: DataObject, subFeed: SparkSubFeed)(implicit context: ActionPipelineContext): SparkSubFeed
The transformed DataFrame is validated to have the output's partition columns included, partition columns are moved to the end and SubFeeds partition values updated.
The transformed DataFrame is validated to have the output's partition columns included, partition columns are moved to the end and SubFeeds partition values updated.
- output
output DataObject
- subFeed
SubFeed with transformed DataFrame
- returns
validated and updated SubFeed
- Definition Classes
- SparkActionImpl
- Annotations
- @Scaladoc()
-
def
validateConfig(): Unit
put configuration validation checks here
put configuration validation checks here
- Definition Classes
- ActionSubFeedsImpl → Action
- Annotations
- @Scaladoc()
-
def
validateDataFrameContainsCols(df: DataFrame, columns: Seq[String], debugName: String): Unit
Validate that DataFrame contains a given list of columns, throwing an exception otherwise.
Validate that DataFrame contains a given list of columns, throwing an exception otherwise.
- df
DataFrame to validate
- columns
Columns that must exist in DataFrame
- debugName
name to mention in exception
- Definition Classes
- SparkActionImpl
- Annotations
- @Scaladoc()
-
def
validatePartitionValuesExisting(dataObject: DataObject with CanHandlePartitions, subFeed: SubFeed)(implicit context: ActionPipelineContext): Unit
- Attributes
- protected
- Definition Classes
- ActionSubFeedsImpl
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
def
writeOutputSubFeeds(subFeeds: Seq[SparkSubFeed])(implicit context: ActionPipelineContext): Unit
- Definition Classes
- ActionSubFeedsImpl
-
def
writeSubFeed(subFeed: SparkSubFeed, output: DataObject with CanWriteDataFrame, isRecursiveInput: Boolean = false)(implicit context: ActionPipelineContext): Option[Boolean]
writes subfeed to output respecting given execution mode
writes subfeed to output respecting given execution mode
- returns
true if no data was transferred, otherwise false. None if unknown.
- Definition Classes
- SparkActionImpl
- Annotations
- @Scaladoc()
-
def
writeSubFeed(subFeed: SparkSubFeed, isRecursive: Boolean)(implicit context: ActionPipelineContext): WriteSubFeedResult
Write subfeed data to output.
Write subfeed data to output. To be implemented by subclass.
- isRecursive
If subfeed is recursive (input & output)
- returns
false if there was no data to process, otherwise true.
- Attributes
- protected
- Definition Classes
- SparkActionImpl → ActionSubFeedsImpl
Deprecated Value Members
-
val
additionalColumns: Option[Map[String, String]]
- Annotations
- @deprecated
- Deprecated
(Since version 2.0.5) Use transformers instead.
-
val
columnBlacklist: Option[Seq[String]]
- Annotations
- @deprecated
- Deprecated
(Since version 2.0.5) Use transformers instead.
-
val
columnWhitelist: Option[Seq[String]]
- Annotations
- @deprecated
- Deprecated
(Since version 2.0.5) Use transformers instead.
-
val
filterClause: Option[String]
- Annotations
- @deprecated
- Deprecated
(Since version 2.0.5) Use transformers instead.
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] ) @Deprecated
- Deprecated
-
val
standardizeDatatypes: Boolean
- Annotations
- @deprecated
- Deprecated
(Since version 2.0.5) Use transformers instead.
-
val
transformer: Option[CustomDfTransformerConfig]
- Annotations
- @deprecated
- Deprecated
(Since version 2.0.5) Use transformers instead.