Class

com.coxautodata.waimak.storage

FileStorageOpsWithStaging

Related Doc: package storage

Permalink

class FileStorageOpsWithStaging extends FileStorageOps with Logging

Implementation around FileSystem and SparkSession with temporary and trash folders.

Linear Supertypes
Logging, FileStorageOps, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. FileStorageOpsWithStaging
  2. Logging
  3. FileStorageOps
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new FileStorageOpsWithStaging(fs: FileSystem, sparkSession: SparkSession, tmpFolder: Path, trashBinFolder: Path)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def atomicWriteAndCleanup(tableName: String, data: Dataset[_], newDataPath: Path, cleanUpPaths: Seq[Path], appendTimestamp: Timestamp): Unit

    Permalink

    During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.

    During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.

    E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.

    Starting state:

    /data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14

    Final state:

    /data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/${appendTimestamp}/region=11 .../region=12 .../region=13 .../region=14

    tableName

    name of the table

    newDataPath

    path into which combined and repartitioned data from the dataset will be committed into

    cleanUpPaths

    list of sub-folders to remove once the writing and committing of the combined data is successful

    appendTimestamp

    Timestamp of the compaction/append. Used to date the Trash folders.

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  6. final def atomicWriteAndCleanup(tableName: String, compactedData: Dataset[_], newDataPath: Path, cleanUpBase: Path, cleanUpFolders: Seq[String], appendTimestamp: Timestamp): Unit

    Permalink

    During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.

    During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.

    E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.

    Starting state:

    /data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14

    Final state:

    /data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/${appendTimestamp}/region=11 .../region=12 .../region=13 .../region=14

    tableName

    name of the table

    compactedData

    the data set with data from fromSubFolders already repartitioned, it will be saved into newDataPath

    newDataPath

    path into which combined and repartitioned data from the dataset will be committed into

    cleanUpBase

    parent folder from which to remove the cleanUpFolders

    cleanUpFolders

    list of sub-folders to remove once the writing and committing of the combined data is successful

    appendTimestamp

    Timestamp of the compaction/append. Used to date the Trash folders.

    Definition Classes
    FileStorageOps
  7. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. def deletePath(path: Path, recursive: Boolean): Unit

    Permalink

    Delete a given path

    Delete a given path

    path

    File or directory to delete

    recursive

    Recurse into directories

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  9. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  10. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  11. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  12. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  13. def globTablePaths[A](basePath: Path, tableNames: Seq[String], tablePartitions: Seq[String], parFun: PartialFunction[FileStatus, A]): Seq[A]

    Permalink

    Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform the FileStatus to any type A

    Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform the FileStatus to any type A

    A

    return type of final sequence

    basePath

    parent folder which contains folders with table names

    tableNames

    list of table names to search under

    tablePartitions

    list of partition columns to include in the path

    parFun

    a partition function to transform FileStatus to any type A

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  14. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  15. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  16. def isTraceEnabled(): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  17. def listTables(basePath: Path): Seq[String]

    Permalink

    Lists tables in the basePath.

    Lists tables in the basePath. It will ignore any folder/table that starts with '.'

    basePath

    parent folder which contains folders with table names

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  18. def logDebug(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  19. def logDebug(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  20. def logError(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  21. def logError(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  22. def logInfo(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  23. def logInfo(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  24. def logName: String

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  25. def logTrace(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  26. def logTrace(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  27. def logWarning(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  28. def logWarning(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  29. def mkdirs(path: Path): Boolean

    Permalink

    Creates folders on the physical storage.

    Creates folders on the physical storage.

    path

    path to create

    returns

    true if the folder exists or was created without problems, false if there were problems creating all folders in the path

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  30. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  31. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  32. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  33. def openParquet(path: Path, paths: Path*): Option[Dataset[_]]

    Permalink

    Opens parquet file from the path, which can be folder or a file.

    Opens parquet file from the path, which can be folder or a file. If there are partitioned sub-folders with file with slightly different schema, it will attempt to merge schema to accommodate for the schema evolution.

    path

    path to open

    returns

    Some with dataset if there is data, None if path does not exist or can not be opened

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
    Exceptions thrown

    Exception in cases of connectivity

  34. def pathExists(path: Path): Boolean

    Permalink

    Checks if the path exists in the physical storage.

    Checks if the path exists in the physical storage.

    returns

    true if path exists in the storage layer

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  35. def purgeTrash(tableName: String, appendTimestamp: Timestamp, trashMaxAge: Duration): Unit

    Permalink

    Purge the trash folder for a given table.

    Purge the trash folder for a given table. All trashed region folders that were placed into the trash older than the given maximum age will be deleted.

    tableName

    Name of the table to purge the trash for

    appendTimestamp

    Timestamp of the current compaction/append. All ages will be compared relative to this timestamp

    trashMaxAge

    Maximum age of trashed regions to keep relative to the above timestamp

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  36. def readAuditTableInfo(basePath: Path, tableName: String): Try[AuditTableInfo]

    Permalink

    Reads the table info back.

    Reads the table info back.

    basePath

    parent folder which contains folders with table names

    tableName

    name of the table to read for

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  37. val sparkSession: SparkSession

    Permalink
  38. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  39. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  40. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  41. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  42. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  43. def writeAuditTableInfo(basePath: Path, auditTableInfo: AuditTableInfo): Try[AuditTableInfo]

    Permalink

    Writes out static data about the audit table into basePath/table_name/.table_info file.

    Writes out static data about the audit table into basePath/table_name/.table_info file.

    basePath

    parent folder which contains folders with table names

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  44. def writeParquet(tableName: String, path: Path, ds: Dataset[_], overwrite: Boolean = true, tempSubfolder: Option[String] = None): Unit

    Permalink

    Commits data set into full path.

    Commits data set into full path. The path is the full path into which the parquet will be placed after it is fully written into the temp folder.

    tableName

    name of the table, will only be used to write into tmp

    path

    full destination path

    ds

    dataset to write out. no partitioning will be performed on it

    overwrite

    whether to overwrite the existing data in path. If false folder contents will be merged

    tempSubfolder

    an optional subfolder used for writing temporary data, used like $temp/$tableName/$tempSubFolder. If not given, then path becomes: $temp/$tableName/${path.getName}

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
    Exceptions thrown

    Exception can be thrown due to access permissions, connectivity, spark UDFs (as datasets are lazily executed)

Inherited from Logging

Inherited from FileStorageOps

Inherited from AnyRef

Inherited from Any

Ungrouped