abstract class ETL[T, C <: Configuration] extends AnyRef
Defines a common workflow for ETL jobs. By definition an ETL can take 1..N sources as input and can produce 1..N output.
- T
Type used to capture data changes in the ETL
- C
Configuration type
- Alphabetic
- By Inheritance
- ETL
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
ETL(context: ETLContext[T, C])
- context
runtime configuration
Abstract Value Members
-
abstract
def
extract(lastRunValue: T = minValue, currentRunValue: T = defaultCurrentValue): Map[String, DataFrame]
Reads data from a file system and produces a Map[DatasetConf, DataFrame].
Reads data from a file system and produces a Map[DatasetConf, DataFrame]. This method should avoid transformation and joins but can implement filters in order to make the ETL more efficient.
- returns
all the data needed to pass to the transform method and produce the desired output.
- abstract def mainDestination: DatasetConf
-
abstract
def
transform(data: Map[String, DataFrame], lastRunValue: T = minValue, currentRunValue: T = defaultCurrentValue): Map[String, DataFrame]
Takes a Map[DatasetConf, DataFrame] as input and applies a set of transformations to it to produce the ETL output.
Takes a Map[DatasetConf, DataFrame] as input and applies a set of transformations to it to produce the ETL output. It is recommended to not read any additional data but to use the extract() method instead to inject input data.
- data
input data
Concrete Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native() @HotSpotIntrinsicCandidate()
- implicit val conf: Configuration
- val defaultCurrentValue: T
- def defaultRepartition: (DataFrame) ⇒ DataFrame
- def defaultSampling: PartialFunction[String, (DataFrame) ⇒ DataFrame]
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
def
getLastRunValue(ds: DatasetConf): T
If possible, fetch the last run value from the dataset passed in argument.
If possible, fetch the last run value from the dataset passed in argument. Usually a date or an id.
- ds
dataset
- returns
the last run value or the minValue
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
def
load(data: Map[String, DataFrame], lastRunValue: T = minValue, currentRunValue: T = defaultCurrentValue): Map[String, DataFrame]
Loads the output data into a persistent storage.
Loads the output data into a persistent storage. The output destination can be any of: object store, database or flat files...
- data
output data produced by the transform method.
- def loadDataset(df: DataFrame, ds: DatasetConf): DataFrame
- val log: Logger
- val minValue: T
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
def
publish(): Unit
OPTIONAL - Contains all actions needed to be done in order to make the data available to users like creating a view with the data.
-
def
replaceWhere: Option[String]
replaceWhere is used in for OverWriteStaticPartition load.
replaceWhere is used in for OverWriteStaticPartition load. It avoids to compute dataframe to infer which partitions to replace. Most of the time, these partitions can be inferred statically. Always prefer that to dynamically overwrite partitions.
-
def
reset(): Unit
Reset the ETL by removing the destination dataset.
-
def
run(lastRunValue: Option[T] = None, currentRunValue: Option[T] = None): Map[String, DataFrame]
Entry point of the etl - execute this method in order to run the whole ETL
-
def
sampling: PartialFunction[String, (DataFrame) ⇒ DataFrame]
Logic used when the ETL is run as a RunStep.sample step.
- implicit val spark: SparkSession
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
- def toMain(df: ⇒ DataFrame): Map[String, DataFrame]
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
Deprecated Value Members
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] ) @Deprecated
- Deprecated