Class

tech.sourced.engine

EngineDataFrame

Related Doc: package engine

Permalink

implicit class EngineDataFrame extends AnyRef

Adds some utility methods to the org.apache.spark.sql.DataFrame class so you can, for example, get the references, commits, etc from a data frame containing repositories.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. EngineDataFrame
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new EngineDataFrame(df: DataFrame)

    Permalink

    df

    the DataFrame

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def classifyLanguages: DataFrame

    Permalink

    Returns a new org.apache.spark.sql.DataFrame with a new column "lang" added containing the language of the non-binary files.

    Returns a new org.apache.spark.sql.DataFrame with a new column "lang" added containing the language of the non-binary files. It requires the current dataframe to have the files data.

    val languagesDf = filesDf.classifyLanguages
    returns

    new DataFrame containing also language data.

  6. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  7. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  8. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  9. def extractTokens(queryColumn: String = "result", outputColumn: String = "tokens"): DataFrame

    Permalink

    Extracts the tokens in all nodes of a given column and puts the list of retrieved tokens in a new column.

    Extracts the tokens in all nodes of a given column and puts the list of retrieved tokens in a new column.

    val tokensDf = uastDf.queryUAST("//\*[@roleIdentifier and not(@roleIncomplete)]")
        .extractTokens()
    queryColumn

    column where the UAST nodes are.

    outputColumn

    column to put the result

    returns

    new DataFrame with the extracted tokens

  10. def extractUASTs(): DataFrame

    Permalink

    Returns a new org.apache.spark.sql.DataFrame with a new column "uast" added, that contains Protobuf serialized UAST.

    Returns a new org.apache.spark.sql.DataFrame with a new column "uast" added, that contains Protobuf serialized UAST. It requires the current dataframe to have file's data: path and content. If language is available, it's going to be used i.e to avoid parsing Text and Markdown.

    val uastsDf = filesDf.extractUASTs
    returns

    new DataFrame that contains Protobuf serialized UAST.

  11. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  12. def getAllReferenceCommits: DataFrame

    Permalink

    Returns a new org.apache.spark.sql.DataFrame with all the commits in a reference.

    Returns a new org.apache.spark.sql.DataFrame with all the commits in a reference. After calling this method, commits may appear multiple times in your DataFrame, because most of your commits will be shared amongst references. Take into account that calling this method will make all your further operations way slower, as there is much more data, specially if you query blobs, which will be repeated over and over.

    For the next example, consider we have a master branch with 100 commits and a "foo" branch whose parent is the HEAD of master and has two more commits.

    > commitsDf.count()
    2
    > commitsDf.getAllReferenceCommits.count()
    202
    returns

    dataframe with all reference commits

  13. def getBlobs: DataFrame

    Permalink

    Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the blobs dataframe.

    Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the blobs dataframe. If the current dataframe does not contain the tree entries data, getTreeEntries will be called automatically.

    val blobsDf = treeEntriesDf.getBlobs
    val blobsDf2 = commitsDf.getBlobs // can be obtained from commits too
    returns

    new DataFrame containing also blob data.

  14. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  15. def getCommits: DataFrame

    Permalink

    Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the commits dataframe, returning only the last commit in a reference (aka the current state).

    Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the commits dataframe, returning only the last commit in a reference (aka the current state). It requires the current dataframe to have a "repository_id" column, which is the identifier of the repository.

    val commitDf = refsDf.getCommits

    You can use tech.sourced.engine.EngineDataFrame#getAllReferenceCommits to get all the commits in the references, but do so knowing that is a very costly operation.

    val allCommitsDf = refsDf.getCommits.getAllReferenceCommits
    returns

    new DataFrame containing also commits data.

  16. def getHEAD: DataFrame

    Permalink

    Returns a new org.apache.spark.sql.DataFrame containing only the rows with a HEAD reference.

    Returns a new org.apache.spark.sql.DataFrame containing only the rows with a HEAD reference.

    val headDf = refsDf.getHEAD
    returns

    new dataframe with only HEAD reference rows

  17. def getMaster: DataFrame

    Permalink

    Returns a new org.apache.spark.sql.DataFrame containing only the rows with a master reference.

    Returns a new org.apache.spark.sql.DataFrame containing only the rows with a master reference.

    val masterDf = refsDf.getMaster
    returns

    new dataframe with only the master reference rows

  18. def getReference(name: String): DataFrame

    Permalink

    Returns a new org.apache.spark.sql.DataFrame containing only the rows with a reference whose name equals the one provided.

    Returns a new org.apache.spark.sql.DataFrame containing only the rows with a reference whose name equals the one provided.

    val developDf = refsDf.getReference("refs/heads/develop")
    name

    name of the reference to filter by

    returns

    new dataframe with only the given reference rows

  19. def getReferences: DataFrame

    Permalink

    Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the references dataframe.

    Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the references dataframe. It requires the dataframe to have an "id" column, which should be the repository identifier.

    val refsDf = reposDf.getReferences
    returns

    new DataFrame containing also references data.

  20. def getRemoteReferences: DataFrame

    Permalink

    Returns a new org.apache.spark.sql.DataFrame with only remote references in it.

    Returns a new org.apache.spark.sql.DataFrame with only remote references in it. If the DataFrame contains repository data it will automatically get the references for those repositories.

    val remoteRefs = reposDf.getRemoteReferences
    val remoteRefs2 = reposDf.getReferences.getRemoteReferences
  21. def getTreeEntries: DataFrame

    Permalink

    Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the tree entries dataframe.

    Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the tree entries dataframe.

    val entriesDf = commitsDf.getTreeEntries
    returns

    new DataFrame containing also tree entries data.

  22. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  23. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  24. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  25. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  26. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  27. def queryUAST(query: String, queryColumn: String = "uast", outputColumn: String = "result"): DataFrame

    Permalink

    Queries a list of UAST nodes with the given query to get specific nodes, and puts the result in another column.

    Queries a list of UAST nodes with the given query to get specific nodes, and puts the result in another column.

    import gopkg.in.bblfsh.sdk.v1.uast.generated.{Node, Role}
    
    // get all identifiers
    var identifiers = uastsDf.queryUAST("//\*[@roleIdentifier and not(@roleIncomplete)]")
      .collect()
      .map(row => row(row.fieldIndex("result")))
      .flatMap(_.asInstanceOf[Seq[Array[Byte]]])
      .map(Node.from)
      .map(_.token)
    
    // get all identifiers from column "foo" and put them in "bar"
    identifiers = uastsDf.queryUAST("//\*[@roleIdentifier and not(@roleIncomplete)]",
                                    "foo", "bar")
      .collect()
      .map(row => row(row.fieldIndex("result")))
      .flatMap(_.asInstanceOf[Seq[Array[Byte]]])
      .map(Node.from)
      .map(_.token)
    query

    xpath query

    queryColumn

    column where the list of UAST nodes to query are

    outputColumn

    column where the result of the query will be placed

    returns

    DataFrame with "result" column containing the nodes

  28. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  29. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  30. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  31. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  32. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from AnyRef

Inherited from Any

Ungrouped