class FileStorageOpsWithStaging extends FileStorageOps with Logging
Implementation around FileSystem and SparkSession with temporary and trash folders.
- Alphabetic
- By Inheritance
- FileStorageOpsWithStaging
- Logging
- FileStorageOps
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
- new FileStorageOpsWithStaging(fs: FileSystem, sparkSession: SparkSession, tmpFolder: Path, trashBinFolder: Path)
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
atomicWriteAndCleanup(tableName: String, data: Dataset[_], newDataPath: Path, cleanUpPaths: Seq[Path], appendTimestamp: Timestamp): Unit
During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.
During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.
E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.
Starting state:
/data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14
Final state:
/data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/${appendTimestamp}/region=11 .../region=12 .../region=13 .../region=14
- tableName
name of the table
- newDataPath
path into which combined and repartitioned data from the dataset will be committed into
- cleanUpPaths
list of sub-folders to remove once the writing and committing of the combined data is successful
- appendTimestamp
Timestamp of the compaction/append. Used to date the Trash folders.
- Definition Classes
- FileStorageOpsWithStaging → FileStorageOps
-
final
def
atomicWriteAndCleanup(tableName: String, compactedData: Dataset[_], newDataPath: Path, cleanUpBase: Path, cleanUpFolders: Seq[String], appendTimestamp: Timestamp): Unit
During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.
During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.
E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.
Starting state:
/data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14
Final state:
/data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/${appendTimestamp}/region=11 .../region=12 .../region=13 .../region=14
- tableName
name of the table
- compactedData
the data set with data from fromSubFolders already repartitioned, it will be saved into newDataPath
- newDataPath
path into which combined and repartitioned data from the dataset will be committed into
- cleanUpBase
parent folder from which to remove the cleanUpFolders
- cleanUpFolders
list of sub-folders to remove once the writing and committing of the combined data is successful
- appendTimestamp
Timestamp of the compaction/append. Used to date the Trash folders.
- Definition Classes
- FileStorageOps
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
deletePath(path: Path, recursive: Boolean): Unit
Delete a given path
Delete a given path
- path
File or directory to delete
- recursive
Recurse into directories
- Definition Classes
- FileStorageOpsWithStaging → FileStorageOps
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
globTablePaths[A](basePath: Path, tableNames: Seq[String], tablePartitions: Seq[String], parFun: PartialFunction[FileStatus, A])(implicit arg0: ClassTag[A]): Seq[A]
Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform the FileStatus to any type
AGlob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform the FileStatus to any type
A- A
return type of final sequence
- basePath
parent folder which contains folders with table names
- tableNames
list of table names to search under
- tablePartitions
list of partition columns to include in the path
- parFun
a partition function to transform FileStatus to any type
A
- Definition Classes
- FileStorageOpsWithStaging → FileStorageOps
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
def
isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
listTables(basePath: Path): Seq[String]
Lists tables in the basePath.
Lists tables in the basePath. It will ignore any folder/table that starts with '.'
- basePath
parent folder which contains folders with table names
- Definition Classes
- FileStorageOpsWithStaging → FileStorageOps
-
def
logAndReturn[A](a: A, msg: String, level: Level): A
- Definition Classes
- Logging
-
def
logAndReturn[A](a: A, message: (A) ⇒ String, level: Level): A
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logName: String
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
mkdirs(path: Path): Boolean
Creates folders on the physical storage.
Creates folders on the physical storage.
- path
path to create
- returns
true if the folder exists or was created without problems, false if there were problems creating all folders in the path
- Definition Classes
- FileStorageOpsWithStaging → FileStorageOps
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
openParquet(path: Path, paths: Path*): Option[Dataset[_]]
Opens parquet file from the path, which can be folder or a file.
Opens parquet file from the path, which can be folder or a file. If there are partitioned sub-folders with file with slightly different schema, it will attempt to merge schema to accommodate for the schema evolution.
- path
path to open
- returns
Some with dataset if there is data, None if path does not exist or can not be opened
- Definition Classes
- FileStorageOpsWithStaging → FileStorageOps
- Exceptions thrown
Exceptionin cases of connectivity
-
def
pathExists(path: Path): Boolean
Checks if the path exists in the physical storage.
Checks if the path exists in the physical storage.
- returns
true if path exists in the storage layer
- Definition Classes
- FileStorageOpsWithStaging → FileStorageOps
-
def
purgeTrash(tableName: String, appendTimestamp: Timestamp, trashMaxAge: Duration): Unit
Purge the trash folder for a given table.
Purge the trash folder for a given table. All trashed region folders that were placed into the trash older than the given maximum age will be deleted.
- tableName
Name of the table to purge the trash for
- appendTimestamp
Timestamp of the current compaction/append. All ages will be compared relative to this timestamp
- trashMaxAge
Maximum age of trashed regions to keep relative to the above timestamp
- Definition Classes
- FileStorageOpsWithStaging → FileStorageOps
-
def
readAuditTableInfo(basePath: Path, tableName: String): Try[AuditTableInfo]
Reads the table info back.
Reads the table info back.
- basePath
parent folder which contains folders with table names
- tableName
name of the table to read for
- Definition Classes
- FileStorageOpsWithStaging → FileStorageOps
-
val
sparkSession: SparkSession
- Definition Classes
- FileStorageOpsWithStaging → FileStorageOps
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
writeAuditTableInfo(basePath: Path, auditTableInfo: AuditTableInfo): Try[AuditTableInfo]
Writes out static data about the audit table into basePath/table_name/.table_info file.
Writes out static data about the audit table into basePath/table_name/.table_info file.
- basePath
parent folder which contains folders with table names
- Definition Classes
- FileStorageOpsWithStaging → FileStorageOps
-
def
writeParquet(tableName: String, path: Path, ds: Dataset[_], overwrite: Boolean = true, tempSubfolder: Option[String] = None): Unit
Commits data set into full path.
Commits data set into full path. The path is the full path into which the parquet will be placed after it is fully written into the temp folder.
- tableName
name of the table, will only be used to write into tmp
- path
full destination path
- ds
dataset to write out. no partitioning will be performed on it
- overwrite
whether to overwrite the existing data in
path. If false folder contents will be merged- tempSubfolder
an optional subfolder used for writing temporary data, used like
$temp/$tableName/$tempSubFolder. If not given, then path becomes:$temp/$tableName/${path.getName}
- Definition Classes
- FileStorageOpsWithStaging → FileStorageOps
- Exceptions thrown
Exceptioncan be thrown due to access permissions, connectivity, spark UDFs (as datasets are lazily executed)