package dataobject
- Alphabetic
- Public
- All
Type Members
-
case class
AccessTableDataObject(id: DataObjectId, path: String, schemaMin: Option[StructType] = None, table: Table, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends TableDataObject with Product with Serializable
DataObject of type JDBC / Access.
DataObject of type JDBC / Access. Provides access to a Access DB to an Action. The functionality is handled seperately from JdbcTableDataObject to avoid problems with net.ucanaccess.jdbc.UcanaccessDriver
- Annotations
- @Scaladoc()
-
case class
ActionsExporterDataObject(id: DataObjectId, config: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with CanCreateDataFrame with ParsableFromConfig[ActionsExporterDataObject] with Product with Serializable
Exports a util DataFrame that contains properties and metadata extracted from all io.smartdatalake.workflow.action.Actions that are registered in the current InstanceRegistry.
Exports a util DataFrame that contains properties and metadata extracted from all io.smartdatalake.workflow.action.Actions that are registered in the current InstanceRegistry.
Alternatively, it can export the properties and metadata of all io.smartdatalake.workflow.action.Actions defined in config files. For this, the configuration "config" has to be set to the location of the config.
Example:
dataObjects = { ... actions-exporter { type = ActionsExporterDataObject config = path/to/myconfiguration.conf } ... }The config value can point to a configuration file or a directory containing configuration files.
- Annotations
- @Scaladoc()
- See also
Refer to ConfigLoader.loadConfigFromFilesystem() for details about the configuration loading.
- case class AirbyteConnectorException(msg: String, cause: Throwable = null) extends Exception with Product with Serializable
-
case class
AirbyteDataObject(id: DataObjectId, config: Config, streamName: String, cmd: ParsableScriptDef, incrementalCursorFields: Seq[String] = Seq(), schemaMin: Option[StructType] = None, metadata: Option[DataObjectMetadata] = None) extends DataObject with CanCreateDataFrame with CanCreateIncrementalOutput with SchemaValidation with SmartDataLakeLogger with Product with Serializable
Limitations: Connectors have only access to locally mounted directories
Limitations: Connectors have only access to locally mounted directories
- id
DataObject identifier
- config
Configuration for the source
- streamName
The stream name to read. Must match an entry of the catalog of the source.
- cmd
command to launch airbyte connector. Normally this is of type DockerRunScript.
- incrementalCursorFields
Some sources need a specification of the cursor field for incremental mode
- Annotations
- @Scaladoc()
-
case class
AvroFileDataObject(id: DataObjectId, path: String, partitions: Seq[String] = Seq(), avroOptions: Option[Map[String, String]] = None, schema: Option[StructType] = None, schemaMin: Option[StructType] = None, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObjectWithEmbeddedSchema with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable
A io.smartdatalake.workflow.dataobject.DataObject backed by an Avro data source.
A io.smartdatalake.workflow.dataobject.DataObject backed by an Avro data source.
It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on Avro formatted files.
Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively. The reader and writer implementations are provided by the databricks spark-avro project.
- avroOptions
Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.
- schema
An optional schema for the spark data frame to be validated on read and write. Note: Existing Avro files contain a source schema. Therefore, this schema is ignored when reading from existing Avro files. As this corresponds to the schema on write, it must not include the optional filenameColumn on read.
- sparkRepartition
Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- housekeepingMode
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
- Annotations
- @Scaladoc()
- See also
org.apache.spark.sql.DataFrameReader
org.apache.spark.sql.DataFrameWriter
-
trait
CanHandlePartitions extends AnyRef
A trait to be implemented by DataObjects which store partitioned data
A trait to be implemented by DataObjects which store partitioned data
- Annotations
- @Scaladoc() @DeveloperApi()
- case class ConnectionTestException(msg: String, ex: Throwable) extends RuntimeException with Product with Serializable
-
case class
CsvFileDataObject(id: DataObjectId, path: String, csvOptions: Map[String, String] = Map(), partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, dateColumnType: DateColumnType = DateColumnType.Date, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObject with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable
A DataObject backed by a comma-separated value (CSV) data source.
A DataObject backed by a comma-separated value (CSV) data source.
It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on CSV formatted files.
CSV reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively.
Read Schema specifications:
If a data object schema is not defined via the
schemaattribute (default) andinferSchemaoption is disabled (default) incsvOptions, then all column types are set to String and the first row of the CSV file is read to determine the column names and the number of fields.If the
headeroption is disabled (default) incsvOptions, then the header is defined as "_c#" for each column where "#" is the column index. Otherwise the first row of the CSV file is not included in the DataFrame content and its entries are used as the column names for the schema.If a data object schema is not defined via the
schemaattribute andinferSchemais enabled incsvOptions, then thesamplingRatio(default: 1.0) option incsvOptionsis used to extract a sample from the CSV file in order to determine the input schema automatically.- csvOptions
Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.
- schema
An optional data object schema. If defined, any automatic schema inference is avoided. As this corresponds to the schema on write, it must not include the optional filenameColumn on read.
- dateColumnType
Specifies the string format used for writing date typed data.
- sparkRepartition
Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- housekeepingMode
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
- Annotations
- @Scaladoc()
- Note
This data object sets the following default values for
csvOptions: delimiter = "|", quote = null, header = false, and inferSchema = false. All othercsvOptiondefault to the values defined by Apache Spark.- See also
org.apache.spark.sql.DataFrameReader
org.apache.spark.sql.DataFrameWriter
-
case class
CustomDfDataObject(id: DataObjectId, creator: CustomDfCreatorConfig, schemaMin: Option[StructType] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with CanCreateDataFrame with SchemaValidation with Product with Serializable
Generic DataObject containing a config object.
Generic DataObject containing a config object. E.g. used to implement a CustomAction that reads a Webservice.
- Annotations
- @Scaladoc()
- case class CustomFileDataObject(id: DataObjectId, creator: CustomFileCreatorConfig, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with FileRefDataObject with CanCreateInputStream with Product with Serializable
-
trait
DataObject extends SdlConfigObject with ParsableFromConfig[DataObject] with SmartDataLakeLogger with AtlasExportable
This is the root trait for every DataObject.
This is the root trait for every DataObject.
- Annotations
- @Scaladoc() @DeveloperApi()
-
case class
DataObjectMetadata(name: Option[String] = None, description: Option[String] = None, layer: Option[String] = None, subjectArea: Option[String] = None, tags: Seq[String] = Seq()) extends Product with Serializable
Additional metadata for a DataObject
Additional metadata for a DataObject
- name
Readable name of the DataObject
- description
Description of the content of the DataObject
- layer
Name of the layer this DataObject belongs to
- subjectArea
Name of the subject area this DataObject belongs to
- tags
Optional custom tags for this object
- Annotations
- @Scaladoc()
-
case class
DataObjectsExporterDataObject(id: DataObjectId, config: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with CanCreateDataFrame with ParsableFromConfig[DataObjectsExporterDataObject] with Product with Serializable
Exports a util DataFrame that contains properties and metadata extracted from all DataObjects that are registered in the current InstanceRegistry.
Exports a util DataFrame that contains properties and metadata extracted from all DataObjects that are registered in the current InstanceRegistry.
Alternatively, it can export the properties and metadata of all DataObjects defined in config files. For this, the configuration "config" has to be set to the location of the config.
Example:
```dataObjects = { ... dataobject-exporter { type = DataObjectsExporterDataObject config = path/to/myconfiguration.conf } ... }The config value can point to a configuration file or a directory containing configuration files.
- Annotations
- @Scaladoc()
- See also
Refer to ConfigLoader.loadConfigFromFilesystem() for details about the configuration loading.
-
case class
ExcelFileDataObject(id: DataObjectId, path: String, excelOptions: ExcelOptions = ExcelOptions(), partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = ..., acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObject with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable
A DataObject backed by an Microsoft Excel data source.
A DataObject backed by an Microsoft Excel data source.
It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on Microsoft Excel (.xslx) formatted files.
Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively. The reader and writer implementation is provided by the Crealytics spark-excel project.
Read Schema:
When
useHeaderis set to true (default), the reader will use the first row of the Excel sheet as column names for the schema and not include the first row as data values. Otherwise the column names are taken from the schema. If the schema is not provided or inferred, then each column name is defined as "_c#" where "#" is the column index.When a data object schema is provided, it is used as the schema for the DataFrame. Otherwise if
inferSchemais enabled (default), then the data types of the columns are inferred based on the firstexcerptSizerows (excluding the first). When no schema is provided andinferSchemais disabled, all columns are assumed to be of string type.- excelOptions
Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.
- schema
An optional data object schema. If defined, any automatic schema inference is avoided. As this corresponds to the schema on write, it must not include the optional filenameColumn on read.
- sparkRepartition
Optional definition of repartition operation before writing DataFrame with Spark to Hadoop. Default is numberOfTasksPerPartition = 1.
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- housekeepingMode
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
- Annotations
- @Scaladoc()
-
case class
ExcelOptions(sheetName: Option[String] = None, numLinesToSkip: Option[Int] = None, startColumn: Option[String] = None, endColumn: Option[String] = None, rowLimit: Option[Int] = None, useHeader: Boolean = true, treatEmptyValuesAsNulls: Option[Boolean] = Some(true), inferSchema: Option[Boolean] = Some(true), timestampFormat: Option[String] = Some("dd-MM-yyyy HH:mm:ss"), dateFormat: Option[String] = None, maxRowsInMemory: Option[Int] = None, excerptSize: Option[Int] = None) extends Product with Serializable
Options passed to org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter for reading and writing Microsoft Excel files.
Options passed to org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter for reading and writing Microsoft Excel files. Excel support is provided by the spark-excel project (see link below).
- sheetName
Optional name of the Excel Sheet to read from/write to.
- numLinesToSkip
Optional number of rows in the excel spreadsheet to skip before any data is read. This option must not be set for writing.
- startColumn
Optional first column in the specified Excel Sheet to read from (as string, e.g B). This option must not be set for writing.
- endColumn
Optional last column in the specified Excel Sheet to read from (as string, e.g. F).
- rowLimit
Optional limit of the number of rows being returned on read. This is applied after
numLinesToSkip.- useHeader
If
true, the first row of the excel sheet specifies the column names (default: true).- treatEmptyValuesAsNulls
Empty cells are parsed as
nullvalues (default: true).- inferSchema
Infer the schema of the excel sheet automatically (default: true).
- timestampFormat
A format string specifying the format to use when writing timestamps (default: dd-MM-yyyy HH:mm:ss).
- dateFormat
A format string specifying the format to use when writing dates.
- maxRowsInMemory
The number of rows that are stored in memory. If set, a streaming reader is used which can help with big files.
- excerptSize
Sample size for schema inference.
- Annotations
- @Scaladoc()
- See also
-
case class
ForeignKey(db: Option[String], table: String, columns: Map[String, String], name: Option[String]) extends Product with Serializable
Foreign key definition
Foreign key definition
- db
target database, if not defined it is assumed to be the same as the table owning the foreign key
- table
referenced target table name
- columns
mapping of source column(s) to referenced target table column(s)
- name
optional name for foreign key, e.g to depict it's role
- Annotations
- @Scaladoc()
- trait HasHadoopStandardFilestore extends CanHandlePartitions
-
case class
HiveTableDataObject(id: DataObjectId, path: Option[String] = None, partitions: Seq[String] = Seq(), analyzeTableAfterWrite: Boolean = false, dateColumnType: DateColumnType = DateColumnType.Date, schemaMin: Option[StructType] = None, table: Table, numInitialHdfsPartitions: Int = 16, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends TableDataObject with CanWriteDataFrame with CanHandlePartitions with HasHadoopStandardFilestore with SmartDataLakeLogger with Product with Serializable
DataObject of type Hive.
DataObject of type Hive. Provides details to access Hive tables to an Action
- id
unique name of this data object
- path
hadoop directory for this table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied. If DataObject is only used for reading or if the HiveTable already exist, the path can be omitted. If the HiveTable already exists but with a different path, a warning is issued
- partitions
partition columns for this data object
- analyzeTableAfterWrite
enable compute statistics after writing data (default=false)
- dateColumnType
type of date column
- schemaMin
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
- table
hive table to be written by this output
- numInitialHdfsPartitions
number of files created when writing into an empty table (otherwise the number will be derived from the existing data)
- saveMode
spark SaveMode to use when writing files, default is "overwrite"
- acl
override connections permissions for files created tables hadoop directory with this connection
- connectionId
optional id of io.smartdatalake.workflow.connection.HiveTableConnection
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- housekeepingMode
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
- metadata
meta data
- Annotations
- @Scaladoc()
- sealed trait HousekeepingMode extends AnyRef
- case class HttpProxyConfig(host: String, port: Int) extends Product with Serializable
- case class HttpTimeoutConfig(connectionTimeoutMs: Int, readTimeoutMs: Int) extends Product with Serializable
-
case class
JdbcTableDataObject(id: DataObjectId, createSql: Option[String] = None, preReadSql: Option[String] = None, postReadSql: Option[String] = None, preWriteSql: Option[String] = None, postWriteSql: Option[String] = None, schemaMin: Option[StructType] = None, table: Table, jdbcFetchSize: Int = 1000, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, allowSchemaEvolution: Boolean = false, connectionId: ConnectionId, jdbcOptions: Map[String, String] = Map(), virtualPartitions: Seq[String] = Seq(), expectedPartitionsCondition: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends TransactionalSparkTableDataObject with CanHandlePartitions with CanEvolveSchema with CanMergeDataFrame with Product with Serializable
DataObject of type JDBC.
DataObject of type JDBC. Provides details for an action to access tables in a database through JDBC.
- id
unique name of this data object
- createSql
DDL-statement to be executed in prepare phase, using output jdbc connection. Note that it is also possible to let Spark create the table in Init-phase. See jdbcOptions to customize column data types for auto-created DDL-statement.
- preReadSql
SQL-statement to be executed in exec phase before reading input table, using input jdbc connection. Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
- postReadSql
SQL-statement to be executed in exec phase after reading input table and before action is finished, using input jdbc connection Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
- preWriteSql
SQL-statement to be executed in exec phase before writing output table, using output jdbc connection Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
- postWriteSql
SQL-statement to be executed in exec phase after writing output table, using output jdbc connection Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
- schemaMin
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
- table
The jdbc table to be read
- jdbcFetchSize
Number of rows to be fetched together by the Jdbc driver
- saveMode
SDLSaveMode to use when writing table, default is "Overwrite". Only "Append" and "Overwrite" supported.
- allowSchemaEvolution
If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.
- connectionId
Id of JdbcConnection configuration
- jdbcOptions
Any jdbc options according to https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html. Note that some options above set and override some of this options explicitly. Use "createTableOptions" and "createTableColumnTypes" to control automatic creating of database tables.
- virtualPartitions
Virtual partition columns. Note that this doesn't need to be the same as the database partition columns for this table. But it is important that there is an index on these columns to efficiently list existing "partitions".
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- Annotations
- @Scaladoc()
-
case class
JsonFileDataObject(id: DataObjectId, path: String, jsonOptions: Option[Map[String, String]] = None, partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, stringify: Boolean = false, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObject with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable
A io.smartdatalake.workflow.dataobject.DataObject backed by a JSON data source.
A io.smartdatalake.workflow.dataobject.DataObject backed by a JSON data source.
It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on JSON formatted files.
Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively.
- jsonOptions
Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.
- schema
An optional data object schema. If defined, any automatic schema inference is avoided. As this corresponds to the schema on write, it must not include the optional filenameColumn on read.
- sparkRepartition
Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.
- stringify
Set the data type for all values to string.
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- housekeepingMode
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
- Annotations
- @Scaladoc()
- Note
By default, the JSON option
multilineis enabled.- See also
org.apache.spark.sql.DataFrameReader
org.apache.spark.sql.DataFrameWriter
-
case class
PKViolatorsDataObject(id: DataObjectId, config: Option[String] = None, flattenOutput: Boolean = false, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with CanCreateDataFrame with ParsableFromConfig[PKViolatorsDataObject] with Product with Serializable
Checks for Primary Key violations for all DataObjects with Primary Keys defined that are registered in the current InstanceRegistry.
Checks for Primary Key violations for all DataObjects with Primary Keys defined that are registered in the current InstanceRegistry. Returns the list of Primary Key violations as a DataFrame.
Alternatively, it can check for Primary Key violations of all DataObjects defined in config files. For this, the configuration "config" has to be set to the location of the config.
Example:
```dataObjects = { ... primarykey-violations { type = PKViolatorsDataObject config = path/to/myconfiguration.conf } ... }- Annotations
- @Scaladoc()
- See also
Refer to ConfigLoader.loadConfigFromFilesystem() for details about the configuration loading.
-
case class
ParquetFileDataObject(id: DataObjectId, path: String, partitions: Seq[String] = Seq(), parquetOptions: Option[Map[String, String]] = None, schema: Option[StructType] = None, schemaMin: Option[StructType] = None, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObjectWithEmbeddedSchema with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable
A io.smartdatalake.workflow.dataobject.DataObject backed by an Apache Hive data source.
A io.smartdatalake.workflow.dataobject.DataObject backed by an Apache Hive data source.
It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on Parquet formatted files.
Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively.
- id
unique name of this data object
- path
Hadoop directory where this data object reads/writes it's files. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied. Optionally defined partitions are appended with hadoop standard partition layout to this path. Only files ending with *.parquet* are considered as data for this DataObject.
- partitions
partition columns for this data object
- parquetOptions
Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.
- schema
An optional schema for the spark data frame to be validated on read and write. Note: Existing Parquet files contain a source schema. Therefore, this schema is ignored when reading from existing Parquet files. As this corresponds to the schema on write, it must not include the optional filenameColumn on read.
- saveMode
spark SaveMode to use when writing files, default is "overwrite"
- sparkRepartition
Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.
- acl
override connections permissions for files created with this connection
- connectionId
optional id of io.smartdatalake.workflow.connection.HadoopFileConnection
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- housekeepingMode
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
- metadata
Metadata describing this data object.
- Annotations
- @Scaladoc()
- See also
org.apache.spark.sql.DataFrameReader
org.apache.spark.sql.DataFrameWriter
-
case class
PartitionArchiveCompactionMode(archivePartitionExpression: Option[String] = None, compactPartitionExpression: Option[String] = None, description: Option[String] = None) extends HousekeepingMode with SmartDataLakeLogger with Product with Serializable
Archive and compact old partitions: Archive partition reduces the number of partitions in the past by moving older partitions into special "archive partitions".
Archive and compact old partitions: Archive partition reduces the number of partitions in the past by moving older partitions into special "archive partitions". Compact partition reduces the number of files in a partition by rewriting them with Spark. Example: archive and compact a table with partition layout run_id=<integer>
- archive partitions after 1000 partitions into "archive partition" equal to floor(run_id/1000)
- compact "archive partition" when full
housekeepingMode = { type = PartitionArchiveCompactionMode archivePartitionExpression = "if( elements['run_id'] < runId - 1000, map('run_id', elements['run_id'] div 1000), elements)" compactPartitionExpression = "elements['run_id'] % 1000 = 0 and elements['run_id'] <= runId - 2000" }- archivePartitionExpression
Expression to define the archive partition for a given partition. Define a spark sql expression working with the attributes of PartitionExpressionData returning archive partition values as Map[String,String]. If return value is the same as input elements, partition is not touched, otherwise all files of the partition are moved to the returned partition definition. Be aware that the value of the partition columns changes for these files/records.
- compactPartitionExpression
Expression to define partitions which should be compacted. Define a spark sql expression working with the attributes of PartitionExpressionData returning a boolean = true when this partition should be compacted. Once a partition is compacted, it is marked as compacted and will not be compacted again. It is therefore ok to return true for all partitions which should be compacted, regardless if they have been compacted already.
- Annotations
- @Scaladoc()
- case class PartitionExpressionData(feed: String, application: String, runId: Int, runStartTime: Timestamp, dataObjectId: String, elements: Map[String, String]) extends Product with Serializable
-
case class
PartitionRetentionMode(retentionCondition: String, description: Option[String] = None) extends HousekeepingMode with SmartDataLakeLogger with Product with Serializable
Keep partitions while retention condition is fulfilled, delete other partitions.
Keep partitions while retention condition is fulfilled, delete other partitions. Example: cleanup partitions with partition layout dt=<yyyymmdd> after 90 days:
housekeepingMode = { type = PartitionRetentionMode retentionCondition = "datediff(now(), to_date(elements['dt'], 'yyyyMMdd')) <= 90" }- retentionCondition
Condition to decide if a partition should be kept. Define a spark sql expression working with the attributes of PartitionExpressionData returning a boolean with value true if the partition should be kept.
- Annotations
- @Scaladoc()
-
case class
RawFileDataObject(id: DataObjectId, path: String, customFormat: Option[String] = None, options: Map[String, String] = Map(), fileName: String = "*", partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObject with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable
DataObject of type raw for files with unknown content.
DataObject of type raw for files with unknown content. Provides details to an Action to access raw files. By specifying format you can custom Spark data formats
- customFormat
Custom Spark data source format, e.g. binaryFile or text. Only needed if you want to read/write this DataObject with Spark.
- options
Options for custom Spark data source format. Only of use if you want to read/write this DataObject with Spark.
- fileName
Definition of fileName. This is concatenated with path and partition layout to search for files. Default is an asterix to match everything.
- saveMode
Overwrite or Append new data.
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- housekeepingMode
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
- Annotations
- @Scaladoc()
-
case class
RelaxedCsvFileDataObject(id: DataObjectId, path: String, csvOptions: Map[String, String] = Map(), partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, dateColumnType: DateColumnType = DateColumnType.Date, treatMissingColumnsAsCorrupt: Boolean = false, treatSuperfluousColumnsAsCorrupt: Boolean = false, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, expectedPartitionsCondition: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObject with Product with Serializable
A DataObject which allows for more flexible CSV parsing.
A DataObject which allows for more flexible CSV parsing. The standard CsvFileDataObject doesnt support reading multiple CSV-Files with different column order, missing columns or additional columns. RelaxCsvFileDataObject works more like reading JSON-Files. You need to define a schema, then it tries to read every file with that schema independently of the column order, adding missing columns and removing superfluous ones.
CSV files are read by Spark as whole text files and then parsed manually with Sparks CSV parser class. You can therefore use the normal CSV options of spark, but some properties are fixed, e.g. header=true, inferSchema=false, enforceSchema (ignored).
- csvOptions
Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.
- schema
The data object schema.
- dateColumnType
Specifies the string format used for writing date typed data.
- treatMissingColumnsAsCorrupt
If set to true records from files with missing columns in its header are treated as corrupt (default=false). Corrupt records are handled according to options.mode (default=permissive).
- treatSuperfluousColumnsAsCorrupt
If set to true records from files with superfluous columns in its header are treated as corrupt (default=false). Corrupt records are handled according to options.mode (default=permissive).
- sparkRepartition
Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- Annotations
- @Scaladoc()
- Note
This data object sets the following default values for
csvOptions: delimiter = ",", quote = null All othercsvOptiondefault to the values defined by Apache Spark.- See also
org.apache.spark.sql.DataFrameReader
org.apache.spark.sql.DataFrameWriter If mode is permissive you can retrieve the corrupt input record by adding <options.columnNameOfCorruptRecord> as field to the schema. RelaxCsvFileDataObject also supports getting an error msg by adding "<options.columnNameOfCorruptRecord>_msg" as field to the schema.
-
class
RelaxedParser extends SmartDataLakeLogger
Relaxed parser which reads CSV-lines with fileSchema and returns Spark Rows with tgtSchema
Relaxed parser which reads CSV-lines with fileSchema and returns Spark Rows with tgtSchema
- Annotations
- @Scaladoc()
-
case class
SFtpFileRefDataObject(id: DataObjectId, path: String, connectionId: ConnectionId, partitions: Seq[String] = Seq(), partitionLayout: Option[String] = None, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, expectedPartitionsCondition: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends FileRefDataObject with CanCreateInputStream with CanCreateOutputStream with SmartDataLakeLogger with Product with Serializable
Connects to SFtp files Needs java library "com.hieronymus % sshj % 0.21.1" The following authentication mechanisms are supported -> public/private-key: private key must be saved in ~/.ssh, public key must be registered on server.
Connects to SFtp files Needs java library "com.hieronymus % sshj % 0.21.1" The following authentication mechanisms are supported -> public/private-key: private key must be saved in ~/.ssh, public key must be registered on server. -> user/pwd authentication: user and password is taken from two variables set as parameters. These variables could come from clear text (CLEAR), a file (FILE) or an environment variable (ENV)
- partitionLayout
partition layout defines how partition values can be extracted from the path. Use "%<colname>%" as token to extract the value for a partition column. With "%<colname:regex>%" a regex can be given to limit search. This is especially useful if there is no char to delimit the last token from the rest of the path or also between two tokens.
- saveMode
Overwrite or Append new data.
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- Annotations
- @Scaladoc()
-
case class
Table(db: Option[String], name: String, query: Option[String] = None, primaryKey: Option[Seq[String]] = None, foreignKeys: Option[Seq[ForeignKey]] = None, options: Option[Map[String, String]] = None) extends Product with Serializable
Table attributes
Table attributes
- db
optional override of db defined by connection
- name
table name
- query
optional select query
- primaryKey
optional sequence of primary key columns
- foreignKeys
optional sequence of foreign key definitions. This is used as metadata for a data catalog.
- Annotations
- @Scaladoc()
- case class TickTockHiveTableDataObject(id: DataObjectId, path: Option[String] = None, partitions: Seq[String] = Seq(), analyzeTableAfterWrite: Boolean = false, dateColumnType: DateColumnType = DateColumnType.Date, schemaMin: Option[StructType] = None, table: Table, numInitialHdfsPartitions: Int = 16, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends TransactionalSparkTableDataObject with CanHandlePartitions with Product with Serializable
-
case class
WebserviceFileDataObject(id: DataObjectId, url: String, additionalHeaders: Map[String, String] = Map(), timeouts: Option[HttpTimeoutConfig] = None, readTimeoutMs: Option[Int] = None, authMode: Option[AuthMode] = None, mimeType: Option[String] = None, writeMethod: WebserviceMethod = WebserviceMethod.Post, proxy: Option[HttpProxyConfig] = None, followRedirects: Boolean = false, partitionDefs: Seq[WebservicePartitionDefinition] = Seq(), partitionLayout: Option[String] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends FileRefDataObject with CanCreateInputStream with CanCreateOutputStream with SmartDataLakeLogger with Product with Serializable
DataObject to call webservice and return response as InputStream This is implemented as FileRefDataObject because the response is treated as some file content.
DataObject to call webservice and return response as InputStream This is implemented as FileRefDataObject because the response is treated as some file content. FileRefDataObjects support partitioned data. For a WebserviceFileDataObject partitions are mapped as query parameters to create query string. All possible query parameter values must be given in configuration.
- partitionDefs
list of partitions with list of possible values for every entry
- partitionLayout
definition of partitions in query string. Use %<partitionColName>% as placeholder for partition column value in layout.
- Annotations
- @Scaladoc()
- case class WebservicePartitionDefinition(name: String, values: Seq[String]) extends Product with Serializable
-
case class
XmlFileDataObject(id: DataObjectId, path: String, rowTag: Option[String] = None, xmlOptions: Option[Map[String, String]] = None, partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, flatten: Boolean = false, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObject with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable
A io.smartdatalake.workflow.dataobject.DataObject backed by an XML data source.
A io.smartdatalake.workflow.dataobject.DataObject backed by an XML data source.
It manages read and write access and configurations required for io.smartdatalake.workflow.action.Actions to work on XML formatted files.
Reading and writing details are delegated to Apache Spark org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter respectively. The reader and writer implementations are provided by the databricks spark-xml project. Note that writing XML-file partitioned is not supported by spark-xml.
- xmlOptions
Settings for the underlying org.apache.spark.sql.DataFrameReader and org.apache.spark.sql.DataFrameWriter.
- schema
An optional data object schema. If defined, any automatic schema inference is avoided. As this corresponds to the schema on write, it must not include the optional filenameColumn on read.
- sparkRepartition
Optional definition of repartition operation before writing DataFrame with Spark to Hadoop.
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- housekeepingMode
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
- Annotations
- @Scaladoc()
- See also
org.apache.spark.sql.DataFrameReader
org.apache.spark.sql.DataFrameWriter
-
class
ZipCsvCodec extends ZipCodec
Codec to read and write zipped Csv-files with Hadoop Note that only the first file entry of a Zip-Archive is read, and only Zip-files with one Entry named "data.csv" can be created.
Codec to read and write zipped Csv-files with Hadoop Note that only the first file entry of a Zip-Archive is read, and only Zip-files with one Entry named "data.csv" can be created. Attention: reading with custom codec in Spark is only implemented for writing files, and not for reading files. Usage in Csv/RelaxedCsvFileDataObject: csv-options { compression = io.smartdatalake.workflow.dataobject.ZipCsvCodec }
- Annotations
- @Scaladoc()
Value Members
- object AccessTableDataObject extends FromConfigFactory[DataObject] with Serializable
- object ActionsExporterDataObject extends FromConfigFactory[ActionsExporterDataObject] with Serializable
- object AirbyteDataObject extends FromConfigFactory[DataObject] with Serializable
- object AvroFileDataObject extends FromConfigFactory[DataObject] with Serializable
- object CsvFileDataObject extends FromConfigFactory[DataObject] with Serializable
- object CustomDfDataObject extends FromConfigFactory[DataObject] with Serializable
- object CustomFileDataObject extends FromConfigFactory[DataObject] with SmartDataLakeLogger with Serializable
- object DataObjectsExporterDataObject extends FromConfigFactory[DataObjectsExporterDataObject] with Serializable
- object ExcelFileDataObject extends FromConfigFactory[DataObject] with Serializable
- object HiveTableDataObject extends FromConfigFactory[DataObject] with Serializable
- object JdbcTableDataObject extends FromConfigFactory[DataObject] with Serializable
- object JsonFileDataObject extends FromConfigFactory[DataObject] with Serializable
- object JsonValidator
- object PKViolatorsDataObject extends FromConfigFactory[PKViolatorsDataObject] with Serializable
- object ParquetFileDataObject extends FromConfigFactory[DataObject] with Serializable
- object RawFileDataObject extends FromConfigFactory[DataObject] with Serializable
- object RelaxedCsvFileDataObject extends FromConfigFactory[DataObject] with Serializable
- object SFtpFileRefDataObject extends FromConfigFactory[DataObject] with Serializable
- object TickTockHiveTableDataObject extends FromConfigFactory[DataObject] with Serializable
- object WebserviceFileDataObject extends FromConfigFactory[DataObject] with SmartDataLakeLogger with Serializable
- object XmlFileDataObject extends FromConfigFactory[DataObject] with Serializable