Default source to provide new git relations.
Engine is the main entry point to all usage of the source{d} spark-engine.
Engine is the main entry point to all usage of the source{d} spark-engine. It has methods to configure all possible configurable options as well as the available methods to start analysing repositories of code.
import tech.sourced.engine._ val engine = Engine(sparkSession, "/path/to/repositories")
NOTE: Keep in mind that you will need to register the UDFs in the session manually if you choose to instantiate this class directly instead of using the companion object.
import tech.sourced.engine.{Engine, SessionFunctions} engine = new Engine(sparkSession) sparkSession.registerUDFs()
The only method available as of now is getRepositories, which will generate a DataFrame of repositories, which is the very first thing you need to analyse repositories of code.
Adds some utility methods to the org.apache.spark.sql.DataFrame class so you can, for example, get the references, commits, etc from a data frame containing repositories.
A relation based on git data from rooted repositories in siva files.
A relation based on git data from rooted repositories in siva files. The data this relation
will offer depends on the given tableSource, which controls the table that will be accessed.
Also, the tech.sourced.engine.rule.GitOptimizer might merge some table sources into one by
squashing joins, so the result will be the resultant table chained with the previous one using
chained iterators.
Spark session
schema of the relation
join conditions, if any
source table if any
Data source to provide new metadata relations.
Implicit class that adds some functions to the org.apache.spark.sql.SparkSession.
Just contains some useful constants for the DefaultSource class to use.
Factory for tech.sourced.engine.Engine instances.
Defines the hierarchy between data sources.
Create a new Node from a binary-encoded node as a byte array.
Create a new Node from a binary-encoded node as a byte array.
binary-encoded node as byte array
parsed Node
Provides the tech.sourced.engine.Engine class, which is the main entry point of all the analysis you might do using this library as well as some implicits to make it easier to use. In particular, it adds some methods to be able to join with other "tables" directly from any org.apache.spark.sql.DataFrame.
If you don't want to import everything in the engine, even though it only exposes what's truly needed to not pollute the user namespace, you can do it by just importing the tech.sourced.engine.Engine class and the tech.sourced.engine.EngineDataFrame implicit class.