package python
- Alphabetic
- Public
- All
Type Members
-
class
BatchProducer extends AutoCloseable
It accepts an iterator as input and will cache the batches when pulling them in from the input for later combination with batches coming back from python by the reader.
It accepts an iterator as input and will cache the batches when pulling them in from the input for later combination with batches coming back from python by the reader. It also supports an optional converter to convert input batches and put the converted result to the cache queue. This is for GpuAggregateInPandas to build and cache key batches.
Call "getBatchQueue" to get the internal cache queue and specify it to the output combination iterator. To access the batches from input, call "asIterator" to get the output iterator.
-
trait
BatchQueue extends AnyRef
A trait provides dedicated APIs for GPU reading batches from python.
A trait provides dedicated APIs for GPU reading batches from python. This is also for easy type declarations since it is implemented by an inner class of BatchProducer.
-
class
CoGroupedIterator extends Iterator[(ColumnarBatch, ColumnarBatch)]
Iterates over the left and right BatchGroupedIterators and returns the cogrouped data, i.e.
Iterates over the left and right BatchGroupedIterators and returns the cogrouped data, i.e. each record is rows having the same grouping key from the two BatchGroupedIterators.
Note: we assume the output of each BatchGroupedIterator is ordered by the grouping key.
-
class
CombiningIterator extends Iterator[ColumnarBatch]
An iterator combines the batches in a
inputBatchQueueand the result batches inpythonOutputIterone by one.An iterator combines the batches in a
inputBatchQueueand the result batches inpythonOutputIterone by one.Both the batches from
inputBatchQueueandpythonOutputItershould have the same row number.In each batch returned by calling to the
next, the columns of the result batch are appended to the columns of the input batch. -
case class
GpuAggregateInPandasExec(gpuGroupingExpressions: Seq[NamedExpression], udfExpressions: Seq[GpuPythonFunction], pyOutAttributes: Seq[Attribute], resultExpressions: Seq[NamedExpression], child: SparkPlan)(cpuGroupingExpressions: Seq[NamedExpression]) extends SparkPlan with ShimUnaryExecNode with GpuPythonExecBase with Product with Serializable
Physical node for aggregation with group aggregate Pandas UDF.
Physical node for aggregation with group aggregate Pandas UDF.
This plan works by sending the necessary (projected) input grouped data as Arrow record batches to the Python worker, the Python worker invokes the UDF and sends the results to the executor. Finally the executor evaluates any post-aggregation expressions and join the result with the grouped key.
This node aims at accelerating the data transfer between JVM and Python for GPU pipeline, and scheduling GPU resources for its Python processes.
-
case class
GpuArrowEvalPythonExec(udfs: Seq[GpuPythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan, evalType: Int) extends SparkPlan with ShimUnaryExecNode with GpuPythonExecBase with Product with Serializable
A physical plan that evaluates a GpuPythonUDF.
A physical plan that evaluates a GpuPythonUDF. The transformation of the data to arrow happens on the GPU (practically a noop), But execution of the UDFs are on the CPU.
- trait GpuArrowOutput extends AnyRef
- abstract class GpuArrowPythonWriter extends GpuArrowWriter
- trait GpuArrowWriter extends AutoCloseable
-
case class
GpuFlatMapCoGroupsInPandasExec(leftGroup: Seq[Attribute], rightGroup: Seq[Attribute], udf: Expression, output: Seq[Attribute], left: SparkPlan, right: SparkPlan) extends SparkPlan with ShimBinaryExecNode with GpuPythonExecBase with Product with Serializable
GPU version of Spark's
FlatMapCoGroupsInPandasExecGPU version of Spark's
FlatMapCoGroupsInPandasExecThis node aims at accelerating the data transfer between JVM and Python for GPU pipeline, and scheduling GPU resources for its Python processes.
- class GpuFlatMapCoGroupsInPandasExecMeta extends SparkPlanMeta[FlatMapCoGroupsInPandasExec]
-
case class
GpuFlatMapGroupsInPandasExec(groupingAttributes: Seq[Attribute], func: Expression, output: Seq[Attribute], child: SparkPlan) extends SparkPlan with ShimUnaryExecNode with GpuPythonExecBase with Product with Serializable
GPU version of Spark's
FlatMapGroupsInPandasExecGPU version of Spark's
FlatMapGroupsInPandasExecRows in each group are passed to the Python worker as an Arrow record batch. The Python worker turns the record batch to a
pandas.DataFrame, invoke the user-defined function, and passes the resultingpandas.DataFrameas an Arrow record batch. Finally, each record batch is turned to a ColumnarBatch.This node aims at accelerating the data transfer between JVM and Python for GPU pipeline, and scheduling GPU resources for its Python processes.
- class GpuFlatMapGroupsInPandasExecMeta extends SparkPlanMeta[FlatMapGroupsInPandasExec]
- trait GpuMapInBatchExec extends SparkPlan with ShimUnaryExecNode with GpuPythonExecBase
- case class GpuMapInPandasExec(func: Expression, output: Seq[Attribute], child: SparkPlan, isBarrier: Boolean) extends SparkPlan with GpuMapInBatchExec with Product with Serializable
- class GpuMapInPandasExecMetaBase extends SparkPlanMeta[MapInPandasExec]
- trait GpuPythonExecBase extends SparkPlan with GpuExec
-
abstract
class
GpuPythonFunction extends Expression with GpuUnevaluable with NonSQLExpression with UserDefinedExpression with GpuAggregateWindowFunction with Serializable
A serialized version of a Python lambda function.
A serialized version of a Python lambda function. This is a special expression, which needs a dedicated physical operator to execute it, and thus can't be pushed down to data sources.
-
trait
GpuPythonRunnerCommon extends AnyRef
A trait to put some common things from Spark for the basic GPU Arrow Python runners
- case class GpuPythonUDAF(name: String, func: PythonFunction, dataType: DataType, children: Seq[Expression], evalType: Int, udfDeterministic: Boolean, resultId: ExprId = NamedExpression.newExprId) extends GpuPythonFunction with GpuAggregateFunction with Product with Serializable
- case class GpuPythonUDF(name: String, func: PythonFunction, dataType: DataType, children: Seq[Expression], evalType: Int, udfDeterministic: Boolean, resultId: ExprId = NamedExpression.newExprId) extends GpuPythonFunction with Product with Serializable
- trait GpuWindowInPandasExecBase extends SparkPlan with ShimUnaryExecNode with GpuPythonExecBase
- abstract class GpuWindowInPandasExecMetaBase extends SparkPlanMeta[WindowInPandasExec]
-
case class
GroupArgs(dedupAttrs: Seq[Attribute], argOffsets: Array[Int], groupingOffsets: Seq[Int]) extends Product with Serializable
A helper class to pack the group related items for the Python input.
A helper class to pack the group related items for the Python input.
- dedupAttrs
the deduplicated attributes for the output of a Spark plan.
- argOffsets
the argument offsets which will be used to distinguish grouping columns and data columns by the Python workers.
- groupingOffsets
the grouping offsets(aka column indices) in the deduplicated attributes.
-
class
GroupingIterator extends Iterator[ColumnarBatch]
This iterator will group the rows in the incoming batches per the window "partitionBy" specification to make sure each group goes into only one batch, and each batch contains only one group data.
-
class
RebatchingRoundoffIterator extends Iterator[ColumnarBatch]
This iterator will round incoming batches to multiples of targetRoundoff rows, if possible.
This iterator will round incoming batches to multiples of targetRoundoff rows, if possible. The last batch might not be a multiple of it.
Value Members
- object GpuAggregateInPandasExec extends Serializable
- object GpuArrowWriter
- object GpuPythonHelper extends Logging
-
object
GpuPythonUDF extends Serializable
Helper functions for GpuPythonUDF