c

com.coxautodata.waimak.storage

FileStorageOpsWithStaging

class FileStorageOpsWithStaging extends FileStorageOps with Logging

Implementation around FileSystem and SparkSession with temporary and trash folders.

Linear Supertypes
Logging, FileStorageOps, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. FileStorageOpsWithStaging
  2. Logging
  3. FileStorageOps
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new FileStorageOpsWithStaging(fs: FileSystem, sparkSession: SparkSession, tmpFolder: Path, trashBinFolder: Path)

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def atomicWriteAndCleanup(tableName: String, data: Dataset[_], newDataPath: Path, cleanUpPaths: Seq[Path], appendTimestamp: Timestamp): Unit

    During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.

    During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.

    E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.

    Starting state:

    /data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14

    Final state:

    /data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/${appendTimestamp}/region=11 .../region=12 .../region=13 .../region=14

    tableName

    name of the table

    newDataPath

    path into which combined and repartitioned data from the dataset will be committed into

    cleanUpPaths

    list of sub-folders to remove once the writing and committing of the combined data is successful

    appendTimestamp

    Timestamp of the compaction/append. Used to date the Trash folders.

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  6. final def atomicWriteAndCleanup(tableName: String, compactedData: Dataset[_], newDataPath: Path, cleanUpBase: Path, cleanUpFolders: Seq[String], appendTimestamp: Timestamp): Unit

    During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files.

    During compaction, data from multiple folders need to be merged and re-written into one folder with fewer files. The operation has to be fail safe; moving out data can only take place after new version is fully written and committed.

    E.g. data from fromBase=/data/db/tbl1/type=hot and fromSubFolders=Seq("region=11", "region=12", "region=13", "region=14") will be merged and coalesced into optimal number of partitions in Dataset data and will be written out into newDataPath=/data/db/tbl1/type=cold/region=15 with old folder being moved into table's trash folder.

    Starting state:

    /data/db/tbl1/type=hot/region=11 .../region=12 .../region=13 .../region=14

    Final state:

    /data/db/tbl1/type=cold/region=15 /data/db/.Trash/tbl1/${appendTimestamp}/region=11 .../region=12 .../region=13 .../region=14

    tableName

    name of the table

    compactedData

    the data set with data from fromSubFolders already repartitioned, it will be saved into newDataPath

    newDataPath

    path into which combined and repartitioned data from the dataset will be committed into

    cleanUpBase

    parent folder from which to remove the cleanUpFolders

    cleanUpFolders

    list of sub-folders to remove once the writing and committing of the combined data is successful

    appendTimestamp

    Timestamp of the compaction/append. Used to date the Trash folders.

    Definition Classes
    FileStorageOps
  7. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  8. def deletePath(path: Path, recursive: Boolean): Unit

    Delete a given path

    Delete a given path

    path

    File or directory to delete

    recursive

    Recurse into directories

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  9. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  10. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  11. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  12. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  13. def globTablePaths[A](basePath: Path, tableNames: Seq[String], tablePartitions: Seq[String], parFun: PartialFunction[FileStatus, A])(implicit arg0: ClassTag[A]): Seq[A]

    Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform the FileStatus to any type A

    Glob a list of table paths with partitions, and apply a partial function to collect (filter+map) the result to transform the FileStatus to any type A

    A

    return type of final sequence

    basePath

    parent folder which contains folders with table names

    tableNames

    list of table names to search under

    tablePartitions

    list of partition columns to include in the path

    parFun

    a partition function to transform FileStatus to any type A

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  14. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  15. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  16. def isTraceEnabled(): Boolean
    Attributes
    protected
    Definition Classes
    Logging
  17. def listTables(basePath: Path): Seq[String]

    Lists tables in the basePath.

    Lists tables in the basePath. It will ignore any folder/table that starts with '.'

    basePath

    parent folder which contains folders with table names

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  18. def logAndReturn[A](a: A, msg: String, level: Level): A
    Definition Classes
    Logging
  19. def logAndReturn[A](a: A, message: (A) ⇒ String, level: Level): A
    Definition Classes
    Logging
  20. def logDebug(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  21. def logDebug(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  22. def logError(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  23. def logError(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  24. def logInfo(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  25. def logInfo(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  26. def logName: String
    Attributes
    protected
    Definition Classes
    Logging
  27. def logTrace(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  28. def logTrace(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  29. def logWarning(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  30. def logWarning(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  31. def mkdirs(path: Path): Boolean

    Creates folders on the physical storage.

    Creates folders on the physical storage.

    path

    path to create

    returns

    true if the folder exists or was created without problems, false if there were problems creating all folders in the path

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  32. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  33. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  34. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  35. def openParquet(path: Path, paths: Path*): Option[Dataset[_]]

    Opens parquet file from the path, which can be folder or a file.

    Opens parquet file from the path, which can be folder or a file. If there are partitioned sub-folders with file with slightly different schema, it will attempt to merge schema to accommodate for the schema evolution.

    path

    path to open

    returns

    Some with dataset if there is data, None if path does not exist or can not be opened

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
    Exceptions thrown

    Exception in cases of connectivity

  36. def pathExists(path: Path): Boolean

    Checks if the path exists in the physical storage.

    Checks if the path exists in the physical storage.

    returns

    true if path exists in the storage layer

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  37. def purgeTrash(tableName: String, appendTimestamp: Timestamp, trashMaxAge: Duration): Unit

    Purge the trash folder for a given table.

    Purge the trash folder for a given table. All trashed region folders that were placed into the trash older than the given maximum age will be deleted.

    tableName

    Name of the table to purge the trash for

    appendTimestamp

    Timestamp of the current compaction/append. All ages will be compared relative to this timestamp

    trashMaxAge

    Maximum age of trashed regions to keep relative to the above timestamp

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  38. def readAuditTableInfo(basePath: Path, tableName: String): Try[AuditTableInfo]

    Reads the table info back.

    Reads the table info back.

    basePath

    parent folder which contains folders with table names

    tableName

    name of the table to read for

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  39. val sparkSession: SparkSession
  40. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  41. def toString(): String
    Definition Classes
    AnyRef → Any
  42. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  43. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  44. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  45. def writeAuditTableInfo(basePath: Path, auditTableInfo: AuditTableInfo): Try[AuditTableInfo]

    Writes out static data about the audit table into basePath/table_name/.table_info file.

    Writes out static data about the audit table into basePath/table_name/.table_info file.

    basePath

    parent folder which contains folders with table names

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
  46. def writeParquet(tableName: String, path: Path, ds: Dataset[_], overwrite: Boolean = true, tempSubfolder: Option[String] = None): Unit

    Commits data set into full path.

    Commits data set into full path. The path is the full path into which the parquet will be placed after it is fully written into the temp folder.

    tableName

    name of the table, will only be used to write into tmp

    path

    full destination path

    ds

    dataset to write out. no partitioning will be performed on it

    overwrite

    whether to overwrite the existing data in path. If false folder contents will be merged

    tempSubfolder

    an optional subfolder used for writing temporary data, used like $temp/$tableName/$tempSubFolder. If not given, then path becomes: $temp/$tableName/${path.getName}

    Definition Classes
    FileStorageOpsWithStagingFileStorageOps
    Exceptions thrown

    Exception can be thrown due to access permissions, connectivity, spark UDFs (as datasets are lazily executed)

Inherited from Logging

Inherited from FileStorageOps

Inherited from AnyRef

Inherited from Any

Ungrouped