case class RawFileDataObject(id: DataObjectId, path: String, customFormat: Option[String] = None, options: Map[String, String] = Map(), fileName: String = "*", partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkFileDataObject with CanCreateDataFrame with CanWriteDataFrame with Product with Serializable
DataObject of type raw for files with unknown content. Provides details to an Action to access raw files. By specifying format you can custom Spark data formats
- customFormat
Custom Spark data source format, e.g. binaryFile or text. Only needed if you want to read/write this DataObject with Spark.
- options
Options for custom Spark data source format. Only of use if you want to read/write this DataObject with Spark.
- fileName
Definition of fileName. This is concatenated with path and partition layout to search for files. Default is an asterix to match everything.
- saveMode
Overwrite or Append new data.
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- housekeepingMode
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
- Annotations
- @Scaladoc()
- Alphabetic
- By Inheritance
- RawFileDataObject
- Serializable
- Serializable
- Product
- Equals
- SparkFileDataObject
- SchemaValidation
- UserDefinedSchema
- CanCreateStreamingDataFrame
- CanWriteDataFrame
- CanCreateDataFrame
- HadoopFileDataObject
- HasHadoopStandardFilestore
- CanCreateOutputStream
- CanCreateInputStream
- FileRefDataObject
- FileDataObject
- CanHandlePartitions
- DataObject
- AtlasExportable
- SmartDataLakeLogger
- ParsableFromConfig
- SdlConfigObject
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
RawFileDataObject(id: DataObjectId, path: String, customFormat: Option[String] = None, options: Map[String, String] = Map(), fileName: String = "*", partitions: Seq[String] = Seq(), schema: Option[StructType] = None, schemaMin: Option[StructType] = None, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, sparkRepartition: Option[SparkRepartitionDef] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, filenameColumn: Option[String] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry)
- customFormat
Custom Spark data source format, e.g. binaryFile or text. Only needed if you want to read/write this DataObject with Spark.
- options
Options for custom Spark data source format. Only of use if you want to read/write this DataObject with Spark.
- fileName
Definition of fileName. This is concatenated with path and partition layout to search for files. Default is an asterix to match everything.
- saveMode
Overwrite or Append new data.
- expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
- housekeepingMode
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
val
acl: Option[AclDef]
Return the ACL definition for the Hadoop path of this DataObject
Return the ACL definition for the Hadoop path of this DataObject
- Definition Classes
- RawFileDataObject → HadoopFileDataObject
- See also
org.apache.hadoop.fs.permission.AclEntry
-
def
addFieldIfNotExisting(writeSchema: StructType, colName: String, dataType: DataType): StructType
- Attributes
- protected
- Definition Classes
- CanCreateDataFrame
-
def
afterRead(df: DataFrame)(implicit context: ActionPipelineContext): DataFrame
Callback that enables potential transformation to be applied to
dfafter the data is read.Callback that enables potential transformation to be applied to
dfafter the data is read.Default is to validate the
schemaMinand not apply any modification.- Definition Classes
- SparkFileDataObject
- Annotations
- @Scaladoc()
-
def
applyAcls(implicit context: ActionPipelineContext): Unit
- Attributes
- protected[workflow]
- Definition Classes
- HadoopFileDataObject
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
atlasName: String
- Definition Classes
- DataObject → AtlasExportable
-
def
atlasQualifiedName(prefix: String): String
- Definition Classes
- AtlasExportable
-
def
beforeWrite(df: DataFrame)(implicit context: ActionPipelineContext): DataFrame
Callback that enables potential transformation to be applied to
dfbefore the data is written.Callback that enables potential transformation to be applied to
dfbefore the data is written.Default is to validate the
schemaMinand not apply any modification.- Definition Classes
- SparkFileDataObject
- Annotations
- @Scaladoc()
-
def
checkFilesExisting(implicit context: ActionPipelineContext): Boolean
Check if the input files exist.
Check if the input files exist.
- Attributes
- protected
- Definition Classes
- HadoopFileDataObject
- Annotations
- @Scaladoc()
- Exceptions thrown
IllegalArgumentExceptioniffailIfFilesMissing= true and no files found atpath.
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native() @HotSpotIntrinsicCandidate()
-
def
compactPartitions(partitionValues: Seq[PartitionValues])(implicit context: ActionPipelineContext): Unit
Compact partitions using Spark
Compact partitions using Spark
- Definition Classes
- SparkFileDataObject → CanHandlePartitions
- Annotations
- @Scaladoc()
-
val
connection: Option[HadoopFileConnection]
- Attributes
- protected
- Definition Classes
- HadoopFileDataObject
-
val
connectionId: Option[ConnectionId]
Return the connection id.
Return the connection id.
Connection defines path prefix (scheme, authority, base path) and ACL's in central location.
- Definition Classes
- RawFileDataObject → HadoopFileDataObject
-
def
createEmptyPartition(partitionValues: PartitionValues)(implicit context: ActionPipelineContext): Unit
create empty partition
create empty partition
- Definition Classes
- HadoopFileDataObject → CanHandlePartitions
-
def
createInputStream(path: String)(implicit context: ActionPipelineContext): InputStream
- Definition Classes
- HadoopFileDataObject → CanCreateInputStream
-
def
createOutputStream(path: String, overwrite: Boolean)(implicit context: ActionPipelineContext): OutputStream
Create an OutputStream for a given path, that the Action can use to write data into.
Create an OutputStream for a given path, that the Action can use to write data into.
- Definition Classes
- HadoopFileDataObject → CanCreateOutputStream
-
def
createReadSchema(writeSchema: StructType)(implicit context: ActionPipelineContext): StructType
Creates the read schema based on a given write schema.
Creates the read schema based on a given write schema. Normally this is the same, but some DataObjects can remove & add columns on read (e.g. KafkaTopicDataObject, SparkFileDataObject) In this cases we have to break the DataFrame lineage und create a dummy DataFrame in init phase.
- Definition Classes
- SparkFileDataObject → CanCreateDataFrame
- val customFormat: Option[String]
-
def
deleteAll(implicit context: ActionPipelineContext): Unit
Delete all data.
Delete all data. This is used to implement SaveMode.Overwrite.
- Definition Classes
- HadoopFileDataObject → FileRefDataObject
-
def
deleteAllFiles(path: Path)(implicit context: ActionPipelineContext): Unit
delete all files inside given path recursively
delete all files inside given path recursively
- Definition Classes
- HadoopFileDataObject
- Annotations
- @Scaladoc()
-
def
deleteFileRefs(fileRefs: Seq[FileRef])(implicit context: ActionPipelineContext): Unit
Delete given files.
Delete given files. This is used to cleanup files after they are processed.
- Definition Classes
- HadoopFileDataObject → FileRefDataObject
-
def
deletePartitions(partitionValues: Seq[PartitionValues])(implicit context: ActionPipelineContext): Unit
Delete Hadoop Partitions.
Delete Hadoop Partitions.
if there is no value for a partition column before the last partition column given, the partition path will be exploded
- Definition Classes
- HadoopFileDataObject → CanHandlePartitions
- Annotations
- @Scaladoc()
-
def
deletePartitionsFiles(partitionValues: Seq[PartitionValues])(implicit context: ActionPipelineContext): Unit
Delete files inside Hadoop Partitions, but keep partition directory to preserve ACLs
Delete files inside Hadoop Partitions, but keep partition directory to preserve ACLs
if there is no value for a partition column before the last partition column given, the partition path will be exploded
- Definition Classes
- HadoopFileDataObject
- Annotations
- @Scaladoc()
-
def
endWritingOutputStreams(partitionValues: Seq[PartitionValues])(implicit context: ActionPipelineContext): Unit
This is called after all output streams have been written.
This is called after all output streams have been written. It is used for e.g. making sure empty partitions are created as well.
- Definition Classes
- HadoopFileDataObject → CanCreateOutputStream
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
val
expectedPartitionsCondition: Option[String]
Definition of partitions that are expected to exists.
Definition of partitions that are expected to exists. This is used to validate that partitions being read exists and don't return no data. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false example: "elements['yourColName'] > 2017"
- returns
true if partition is expected to exist.
- Definition Classes
- RawFileDataObject → CanHandlePartitions
-
def
extractPartitionValuesFromPath(filePath: String)(implicit context: ActionPipelineContext): PartitionValues
Extract partition values from a given file path
Extract partition values from a given file path
- Attributes
- protected
- Definition Classes
- FileRefDataObject
- Annotations
- @Scaladoc()
-
def
factory: FromConfigFactory[DataObject]
Returns the factory that can parse this type (that is, type
CO).Returns the factory that can parse this type (that is, type
CO).Typically, implementations of this method should return the companion object of the implementing class. The companion object in turn should implement FromConfigFactory.
- returns
the factory (object) for this class.
- Definition Classes
- RawFileDataObject → ParsableFromConfig
-
def
failIfFilesMissing: Boolean
Configure whether io.smartdatalake.workflow.action.Actions should fail if the input file(s) are missing on the file system.
Configure whether io.smartdatalake.workflow.action.Actions should fail if the input file(s) are missing on the file system.
Default is false.
- Definition Classes
- HadoopFileDataObject
- Annotations
- @Scaladoc()
-
val
fileName: String
Definition of fileName.
Definition of fileName. Default is an asterix to match everything. This is concatenated with the partition layout to search for files.
- Definition Classes
- RawFileDataObject → FileRefDataObject
-
val
filenameColumn: Option[String]
The name of the (optional) additional column containing the source filename
The name of the (optional) additional column containing the source filename
- Definition Classes
- RawFileDataObject → SparkFileDataObject
-
def
filterPartitionsExisting(partitionValues: Seq[PartitionValues])(implicit context: ActionPipelineContext): Seq[PartitionValues]
Filters only existing partition.
Filters only existing partition. Note that partition values to check don't need to have a key/value defined for every partition column.
- Definition Classes
- SparkFileDataObject
- Annotations
- @Scaladoc()
-
def
format: String
The Spark-Format provider to be used
The Spark-Format provider to be used
- Definition Classes
- RawFileDataObject → SparkFileDataObject
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
def
getConcretePaths(pv: PartitionValues)(implicit context: ActionPipelineContext): Seq[Path]
Generate all paths for given partition values exploding undefined partitions before the last given partition value.
Generate all paths for given partition values exploding undefined partitions before the last given partition value. Use case: Reading all files from a given path with spark cannot contain wildcards. If there are partitions without given partition value before the last partition value given, they must be searched with globs.
- Definition Classes
- HadoopFileDataObject
- Annotations
- @Scaladoc()
-
def
getConnection[T <: Connection](connectionId: ConnectionId)(implicit registry: InstanceRegistry, ct: ClassTag[T], tt: scala.reflect.api.JavaUniverse.TypeTag[T]): T
Handle class cast exception when getting objects from instance registry
Handle class cast exception when getting objects from instance registry
- Attributes
- protected
- Definition Classes
- DataObject
- Annotations
- @Scaladoc()
-
def
getConnectionReg[T <: Connection](connectionId: ConnectionId, registry: InstanceRegistry)(implicit ct: ClassTag[T], tt: scala.reflect.api.JavaUniverse.TypeTag[T]): T
- Attributes
- protected
- Definition Classes
- DataObject
-
def
getDataFrame(partitionValues: Seq[PartitionValues] = Seq())(implicit context: ActionPipelineContext): DataFrame
Constructs an Apache Spark DataFrame from the underlying file content.
Constructs an Apache Spark DataFrame from the underlying file content.
- returns
a new DataFrame containing the data stored in the file at
path
- Definition Classes
- SparkFileDataObject → CanCreateDataFrame
- Annotations
- @Scaladoc()
- See also
DataFrameReader
-
def
getFileRefs(partitionValues: Seq[PartitionValues])(implicit context: ActionPipelineContext): Seq[FileRef]
List files for given partition values
List files for given partition values
- partitionValues
List of partition values to be filtered. If empty all files in root path of DataObject will be listed.
- returns
List of FileRefs
- Definition Classes
- HadoopFileDataObject → FileRefDataObject
-
def
getPartitionString(partitionValues: PartitionValues)(implicit context: ActionPipelineContext): Option[String]
get partition values formatted by partition layout
get partition values formatted by partition layout
- Definition Classes
- FileRefDataObject
- Annotations
- @Scaladoc()
-
def
getPath(implicit context: ActionPipelineContext): String
Method for subclasses to override the base path for this DataObject.
Method for subclasses to override the base path for this DataObject. This is for instance needed if pathPrefix is defined in a connection.
- Definition Classes
- HadoopFileDataObject → FileRefDataObject
-
def
getSchema(sourceExists: Boolean): Option[StructType]
Returns the user-defined schema for reading from the data source.
Returns the user-defined schema for reading from the data source. By default, this should return
schemabut it may be customized by data objects that have a source schema and ignore the user-defined schema on read operations.If a user-defined schema is returned, it overrides any schema inference. If no user-defined schema is set, the schema may be inferred depending on the configuration and type of data frame reader.
- sourceExists
Whether the source file/table exists already. Existing sources may have a source schema.
- returns
The schema to use for the data frame reader when reading from the source.
- Definition Classes
- SparkFileDataObject
- Annotations
- @Scaladoc()
-
def
getSearchPaths(partitionValues: Seq[PartitionValues])(implicit context: ActionPipelineContext): Seq[(PartitionValues, String)]
prepare paths to be searched
prepare paths to be searched
- Attributes
- protected
- Definition Classes
- FileRefDataObject
- Annotations
- @Scaladoc()
-
def
getStreamingDataFrame(options: Map[String, String], pipelineSchema: Option[StructType])(implicit context: ActionPipelineContext): DataFrame
- Definition Classes
- SparkFileDataObject → CanCreateStreamingDataFrame
-
def
hadoopPath(implicit context: ActionPipelineContext): Path
- Definition Classes
- HadoopFileDataObject → HasHadoopStandardFilestore
-
val
housekeepingMode: Option[HousekeepingMode]
Configure a housekeeping mode to e.g cleanup, archive and compact partitions.
Configure a housekeeping mode to e.g cleanup, archive and compact partitions. Default is None.
- Definition Classes
- RawFileDataObject → DataObject
-
val
id: DataObjectId
A unique identifier for this instance.
A unique identifier for this instance.
- Definition Classes
- RawFileDataObject → DataObject → SdlConfigObject
-
def
init(df: DataFrame, partitionValues: Seq[PartitionValues], saveModeOptions: Option[SaveModeOptions] = None)(implicit context: ActionPipelineContext): Unit
Called during init phase for checks and initialization.
Called during init phase for checks and initialization. If possible dont change the system until execution phase.
- Definition Classes
- SparkFileDataObject → CanWriteDataFrame
-
implicit
val
instanceRegistry: InstanceRegistry
Return the InstanceRegistry parsed from the SDL configuration used for this run.
Return the InstanceRegistry parsed from the SDL configuration used for this run.
- returns
the current InstanceRegistry.
- Definition Classes
- RawFileDataObject → HadoopFileDataObject
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
def
listPartitions(implicit context: ActionPipelineContext): Seq[PartitionValues]
List partitions on data object's root path
List partitions on data object's root path
- Definition Classes
- HadoopFileDataObject → CanHandlePartitions
- Annotations
- @Scaladoc()
-
lazy val
logger: Logger
- Attributes
- protected
- Definition Classes
- SmartDataLakeLogger
- Annotations
- @transient()
-
val
metadata: Option[DataObjectMetadata]
Additional metadata for the DataObject
Additional metadata for the DataObject
- Definition Classes
- RawFileDataObject → DataObject
-
def
movePartitions(partitionValuesMapping: Seq[(PartitionValues, PartitionValues)])(implicit context: ActionPipelineContext): Unit
Move given partitions.
Move given partitions. This is used to archive partitions by housekeeping. Note: this is optional to implement.
- Definition Classes
- HadoopFileDataObject → CanHandlePartitions
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
val
options: Map[String, String]
Returns the configured options for the Spark DataFrameReader/DataFrameWriter.
Returns the configured options for the Spark DataFrameReader/DataFrameWriter.
- Definition Classes
- RawFileDataObject → SparkFileDataObject
- See also
DataFrameReader
DataFrameWriter
-
def
partitionLayout(): Option[String]
Return a String specifying the partition layout.
Return a String specifying the partition layout. For Hadoop the default partition layout is colname1=<value1>/colname2=<value2>/.../
- Definition Classes
- HasHadoopStandardFilestore
- Annotations
- @Scaladoc()
-
val
partitions: Seq[String]
Definition of partition columns
Definition of partition columns
- Definition Classes
- RawFileDataObject → CanHandlePartitions
-
val
path: String
The root path of the files that are handled by this DataObject.
The root path of the files that are handled by this DataObject.
- Definition Classes
- RawFileDataObject → FileDataObject
-
def
postWrite(partitionValues: Seq[PartitionValues])(implicit context: ActionPipelineContext): Unit
Runs operations after writing to DataObject
Runs operations after writing to DataObject
- Definition Classes
- HadoopFileDataObject → DataObject
-
def
preWrite(implicit context: ActionPipelineContext): Unit
Runs operations before writing to DataObject Note: As the transformed SubFeed doesnt yet exist in Action.preWrite, no partition values can be passed as parameters as in preRead
Runs operations before writing to DataObject Note: As the transformed SubFeed doesnt yet exist in Action.preWrite, no partition values can be passed as parameters as in preRead
- Definition Classes
- HadoopFileDataObject → DataObject
-
def
prepare(implicit context: ActionPipelineContext): Unit
Prepare & test DataObject's prerequisits
Prepare & test DataObject's prerequisits
This runs during the "prepare" operation of the DAG.
- Definition Classes
- HadoopFileDataObject → FileDataObject → DataObject
-
def
relativizePath(path: String)(implicit context: ActionPipelineContext): String
Make a given path relative to this DataObjects base path
Make a given path relative to this DataObjects base path
- Definition Classes
- HadoopFileDataObject → FileDataObject
-
val
saveMode: SDLSaveMode
Overwrite or Append new data.
Overwrite or Append new data. When writing partitioned data, this applies only to partitions concerned.
- Definition Classes
- RawFileDataObject → FileRefDataObject
-
val
schema: Option[StructType]
An optional DataObject user-defined schema definition.
An optional DataObject user-defined schema definition.
Some DataObjects support optional schema inference. Specifying this attribute disables automatic schema inference. When the wrapped data source contains a source schema, this
schemaattribute is ignored.Note: This is only used by the functionality defined in CanCreateDataFrame, that is, when reading Spark data frames from the underlying data container. io.smartdatalake.workflow.action.Actions that bypass Spark data frames ignore the
schemaattribute if it is defined.- Definition Classes
- RawFileDataObject → UserDefinedSchema
-
val
schemaMin: Option[StructType]
An optional, minimal schema that a DataObject schema must have to pass schema validation.
An optional, minimal schema that a DataObject schema must have to pass schema validation.
The schema validation semantics are: - Schema A is valid in respect to a minimal schema B when B is a subset of A. This means: the whole column set of B is contained in the column set of A.
- A column of B is contained in A when A contains a column with equal name and data type.
- Column order is ignored.
- Column nullability is ignored.
- Duplicate columns in terms of name and data type are eliminated (set semantics).
Note: This is mainly used by the functionality defined in CanCreateDataFrame and CanWriteDataFrame, that is, when reading or writing Spark data frames from/to the underlying data container. io.smartdatalake.workflow.action.Actions that work with files ignore the
schemaMinattribute if it is defined. Additionally schemaMin can be used to define the schema used if there is no data or table doesn't yet exist.- Definition Classes
- RawFileDataObject → SchemaValidation
-
val
separator: Char
default separator for paths
default separator for paths
- Attributes
- protected
- Definition Classes
- FileDataObject
- Annotations
- @Scaladoc()
-
val
sparkRepartition: Option[SparkRepartitionDef]
Definition of repartition operation before writing DataFrame with Spark to Hadoop.
Definition of repartition operation before writing DataFrame with Spark to Hadoop.
- Definition Classes
- RawFileDataObject → SparkFileDataObject
-
def
startWritingOutputStreams(partitionValues: Seq[PartitionValues] = Seq())(implicit context: ActionPipelineContext): Unit
This is called before any output stream is created to initialize writing.
This is called before any output stream is created to initialize writing. It is used to apply SaveMode, e.g. deleting existing partitions.
- Definition Classes
- HadoopFileDataObject → CanCreateOutputStream
-
def
streamingOptions: Map[String, String]
- Definition Classes
- CanWriteDataFrame
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toStringShort: String
- Definition Classes
- DataObject
-
def
translateFileRefs(fileRefs: Seq[FileRef])(implicit context: ActionPipelineContext): Seq[FileRefMapping]
Given some FileRef for another DataObject, translate the paths to the root path of this DataObject
Given some FileRef for another DataObject, translate the paths to the root path of this DataObject
- Definition Classes
- FileRefDataObject
- Annotations
- @Scaladoc()
-
def
validateSchema(df: DataFrame, schemaExpected: StructType, role: String): Unit
Validate the schema of a given Spark Data Frame
dfagainst a given expected schema.Validate the schema of a given Spark Data Frame
dfagainst a given expected schema.- df
The data frame to validate.
- schemaExpected
The expected schema to validate against.
- role
role used in exception message. Set to read or write.
- Definition Classes
- SchemaValidation
- Annotations
- @Scaladoc()
- Exceptions thrown
SchemaViolationExceptionis theschemaMindoes not validate.
-
def
validateSchemaHasPartitionCols(df: DataFrame, role: String): Unit
Validate the schema of a given Spark Data Frame
dfthat it contains the specified partition columnsValidate the schema of a given Spark Data Frame
dfthat it contains the specified partition columns- df
The data frame to validate.
- role
role used in exception message. Set to read or write.
- Definition Classes
- CanHandlePartitions
- Annotations
- @Scaladoc()
- Exceptions thrown
SchemaViolationExceptionif the partitions columns are not included.
-
def
validateSchemaHasPrimaryKeyCols(df: DataFrame, primaryKeyCols: Seq[String], role: String): Unit
Validate the schema of a given Spark Data Frame
dfthat it contains the specified primary key columnsValidate the schema of a given Spark Data Frame
dfthat it contains the specified primary key columns- df
The data frame to validate.
- role
role used in exception message. Set to read or write.
- Definition Classes
- CanHandlePartitions
- Annotations
- @Scaladoc()
- Exceptions thrown
SchemaViolationExceptionif the partitions columns are not included.
-
def
validateSchemaMin(df: DataFrame, role: String): Unit
Validate the schema of a given Spark Data Frame
dfagainstschemaMin.Validate the schema of a given Spark Data Frame
dfagainstschemaMin.- df
The data frame to validate.
- role
role used in exception message. Set to read or write.
- Definition Classes
- SchemaValidation
- Annotations
- @Scaladoc()
- Exceptions thrown
SchemaViolationExceptionis theschemaMindoes not validate.
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
writeDataFrame(df: DataFrame, partitionValues: Seq[PartitionValues] = Seq(), isRecursiveInput: Boolean = false, saveModeOptions: Option[SaveModeOptions] = None)(implicit context: ActionPipelineContext): Unit
Writes the provided DataFrame to the filesystem.
Writes the provided DataFrame to the filesystem.
The
partitionValuesattribute is used to partition the output by the given columns on the file system.- df
the DataFrame to write to the file system.
- partitionValues
The partition layout to write.
- isRecursiveInput
if DataFrame needs this DataObject as input - special treatment might be needed in this case.@param session the current SparkSession.
- Definition Classes
- SparkFileDataObject → CanWriteDataFrame
- Annotations
- @Scaladoc()
- See also
DataFrameWriter.partitionBy
-
def
writeStreamingDataFrame(df: DataFrame, trigger: Trigger, options: Map[String, String], checkpointLocation: String, queryName: String, outputMode: OutputMode = OutputMode.Append, saveModeOptions: Option[SaveModeOptions] = None)(implicit context: ActionPipelineContext): StreamingQuery
Write Spark structured streaming DataFrame The default implementation uses foreachBatch and this traits writeDataFrame method to write the DataFrame.
Write Spark structured streaming DataFrame The default implementation uses foreachBatch and this traits writeDataFrame method to write the DataFrame. Some DataObjects will override this with specific implementations (Kafka).
- df
The Streaming DataFrame to write
- trigger
Trigger frequency for stream
- checkpointLocation
location for checkpoints of streaming query
- Definition Classes
- SparkFileDataObject → CanWriteDataFrame
Deprecated Value Members
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] ) @Deprecated
- Deprecated