Returns a new org.apache.spark.sql.DataFrame with a new column "lang" added containing the language of the non-binary files.
Returns a new org.apache.spark.sql.DataFrame with a new column "lang" added containing the language of the non-binary files. It requires the current dataframe to have the files data.
val languagesDf = filesDf.classifyLanguagesnew DataFrame containing also language data.
Extracts the tokens in all nodes of a given column and puts the list of retrieved tokens in a new column.
Extracts the tokens in all nodes of a given column and puts the list of retrieved tokens in a new column.
val tokensDf = uastDf.queryUAST("//\*[@roleIdentifier and not(@roleIncomplete)]") .extractTokens()
column where the UAST nodes are.
column to put the result
new DataFrame with the extracted tokens
Returns a new org.apache.spark.sql.DataFrame with a new column "uast" added, that contains Protobuf serialized UAST.
Returns a new org.apache.spark.sql.DataFrame with a new column "uast" added, that contains Protobuf serialized UAST. It requires the current dataframe to have file's data: path and content. If language is available, it's going to be used i.e to avoid parsing Text and Markdown.
val uastsDf = filesDf.extractUASTsnew DataFrame that contains Protobuf serialized UAST.
Returns a new org.apache.spark.sql.DataFrame with all the commits in a reference.
Returns a new org.apache.spark.sql.DataFrame with all the commits in a reference. After calling this method, commits may appear multiple times in your DataFrame, because most of your commits will be shared amongst references. Take into account that calling this method will make all your further operations way slower, as there is much more data, specially if you query blobs, which will be repeated over and over.
For the next example, consider we have a master branch with 100 commits and a "foo" branch whose parent is the HEAD of master and has two more commits.
> commitsDf.count() 2 > commitsDf.getAllReferenceCommits.count() 202
dataframe with all reference commits
Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the blobs dataframe.
Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the blobs dataframe. If the current dataframe does not contain the tree entries data, getTreeEntries will be called automatically.
val blobsDf = treeEntriesDf.getBlobs val blobsDf2 = commitsDf.getBlobs // can be obtained from commits too
new DataFrame containing also blob data.
Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the commits dataframe, returning only the last commit in a reference (aka the current state).
Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the commits dataframe, returning only the last commit in a reference (aka the current state). It requires the current dataframe to have a "repository_id" column, which is the identifier of the repository.
val commitDf = refsDf.getCommitsYou can use tech.sourced.engine.EngineDataFrame#getAllReferenceCommits to get all the commits in the references, but do so knowing that is a very costly operation.
val allCommitsDf = refsDf.getCommits.getAllReferenceCommitsnew DataFrame containing also commits data.
Returns a new org.apache.spark.sql.DataFrame containing only the rows with a HEAD reference.
Returns a new org.apache.spark.sql.DataFrame containing only the rows with a HEAD reference.
val headDf = refsDf.getHEADnew dataframe with only HEAD reference rows
Returns a new org.apache.spark.sql.DataFrame containing only the rows with a master reference.
Returns a new org.apache.spark.sql.DataFrame containing only the rows with a master reference.
val masterDf = refsDf.getMasternew dataframe with only the master reference rows
Returns a new org.apache.spark.sql.DataFrame containing only the rows with a reference whose name equals the one provided.
Returns a new org.apache.spark.sql.DataFrame containing only the rows with a reference whose name equals the one provided.
val developDf = refsDf.getReference("refs/heads/develop")
name of the reference to filter by
new dataframe with only the given reference rows
Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the references dataframe.
Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the references dataframe. It requires the dataframe to have an "id" column, which should be the repository identifier.
val refsDf = reposDf.getReferencesnew DataFrame containing also references data.
Returns a new org.apache.spark.sql.DataFrame with only remote references in it.
Returns a new org.apache.spark.sql.DataFrame with only remote references in it. If the DataFrame contains repository data it will automatically get the references for those repositories.
val remoteRefs = reposDf.getRemoteReferences val remoteRefs2 = reposDf.getReferences.getRemoteReferences
Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the tree entries dataframe.
Returns a new org.apache.spark.sql.DataFrame with the product of joining the current dataframe with the tree entries dataframe.
val entriesDf = commitsDf.getTreeEntriesnew DataFrame containing also tree entries data.
Queries a list of UAST nodes with the given query to get specific nodes, and puts the result in another column.
Queries a list of UAST nodes with the given query to get specific nodes, and puts the result in another column.
import gopkg.in.bblfsh.sdk.v1.uast.generated.{Node, Role} // get all identifiers var identifiers = uastsDf.queryUAST("//\*[@roleIdentifier and not(@roleIncomplete)]") .collect() .map(row => row(row.fieldIndex("result"))) .flatMap(_.asInstanceOf[Seq[Array[Byte]]]) .map(Node.from) .map(_.token) // get all identifiers from column "foo" and put them in "bar" identifiers = uastsDf.queryUAST("//\*[@roleIdentifier and not(@roleIncomplete)]", "foo", "bar") .collect() .map(row => row(row.fieldIndex("result"))) .flatMap(_.asInstanceOf[Seq[Array[Byte]]]) .map(Node.from) .map(_.token)
xpath query
column where the list of UAST nodes to query are
column where the result of the query will be placed
DataFrame with "result" column containing the nodes
Adds some utility methods to the org.apache.spark.sql.DataFrame class so you can, for example, get the references, commits, etc from a data frame containing repositories.