public class SparkDl4jMultiLayer extends Object implements Serializable
| Modifier and Type | Field and Description |
|---|---|
static String |
ACCUM_GRADIENT |
static String |
AVERAGE_EACH_ITERATION |
static int |
DEFAULT_EVAL_SCORE_BATCH_SIZE |
static String |
DIVIDE_ACCUM_GRADIENT |
| Constructor and Description |
|---|
SparkDl4jMultiLayer(org.apache.spark.api.java.JavaSparkContext sc,
MultiLayerConfiguration conf)
Training constructor.
|
SparkDl4jMultiLayer(org.apache.spark.api.java.JavaSparkContext javaSparkContext,
MultiLayerNetwork network) |
SparkDl4jMultiLayer(org.apache.spark.SparkContext sparkContext,
MultiLayerConfiguration conf)
Training constructor.
|
SparkDl4jMultiLayer(org.apache.spark.SparkContext sparkContext,
MultiLayerNetwork network)
Instantiate a multi layer spark instance
with the given context and network.
|
| Modifier and Type | Method and Description |
|---|---|
double |
calculateScore(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean average) |
Evaluation |
evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
Evaluate the network (classification performance) in a distributed manner on the provided data
|
Evaluation |
evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
List<String> labelsList)
Evaluate the network (classification performance) in a distributed manner, using default batch size and a provided
list of labels
|
Evaluation |
evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
List<String> labelsList,
int evalBatchSize)
Evaluate the network (classification performance) in a distributed manner, using specified batch size and a provided
list of labels
|
MultiLayerNetwork |
fit(org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> rdd,
int batchSize)
Fit the given rdd given the context.
|
MultiLayerNetwork |
fit(org.apache.spark.api.java.JavaSparkContext sc,
org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> rdd)
Fit the given rdd given the context.
|
MultiLayerNetwork |
fit(String path,
int labelIndex,
org.canova.api.records.reader.RecordReader recordReader)
Train a multi layer network based on data loaded from a text file +
RecordReader. |
MultiLayerNetwork |
fit(String path,
int labelIndex,
org.canova.api.records.reader.RecordReader recordReader,
int examplesPerFit,
int numPartitions)
Train a multi layer network based on data loaded from a text file +
RecordReader. |
MultiLayerNetwork |
fit(String path,
int labelIndex,
org.canova.api.records.reader.RecordReader recordReader,
int examplesPerFit,
int totalExamples,
int numPartitions)
Train a multi layer network based on data loaded from a text file +
RecordReader. |
MultiLayerNetwork |
fitDataSet(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> rdd)
Fit the dataset rdd
|
MultiLayerNetwork |
fitDataSet(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> rdd,
int examplesPerFit,
int numPartitions)
Equivalent to
fitDataSet(JavaRDD, int, int, int), but persist and count the size of the data set first,
instead of requiring the data set size to be provided externally. |
MultiLayerNetwork |
fitDataSet(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> rdd,
int examplesPerFit,
int totalExamples,
int numPartitions)
Fit the data, splitting into smaller data subsets if necessary.
|
MultiLayerNetwork |
getNetwork() |
double |
getScore()
Gets the last (average) minibatch score from calling fit
|
protected void |
invokeListeners(MultiLayerNetwork network,
int iteration) |
org.apache.spark.mllib.linalg.Matrix |
predict(org.apache.spark.mllib.linalg.Matrix features)
Predict the given feature matrix
|
org.apache.spark.mllib.linalg.Vector |
predict(org.apache.spark.mllib.linalg.Vector point)
Predict the given vector
|
protected void |
runIteration(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> rdd) |
<K> org.apache.spark.api.java.JavaPairRDD<K,Double> |
scoreExamples(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms)
Score the examples individually, using the default batch size
DEFAULT_EVAL_SCORE_BATCH_SIZE. |
<K> org.apache.spark.api.java.JavaPairRDD<K,Double> |
scoreExamples(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms,
int batchSize)
Score the examples individually, using a specified batch size.
|
org.apache.spark.api.java.JavaDoubleRDD |
scoreExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms)
Score the examples individually, using the default batch size
DEFAULT_EVAL_SCORE_BATCH_SIZE. |
org.apache.spark.api.java.JavaDoubleRDD |
scoreExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms,
int batchSize)
Score the examples individually, using a specified batch size.
|
void |
setListeners(Collection<IterationListener> listeners)
This method allows you to specify IterationListeners for this model.
|
void |
setNetwork(MultiLayerNetwork network) |
static MultiLayerNetwork |
train(org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> data,
MultiLayerConfiguration conf)
Train a multi layer network
|
public static final int DEFAULT_EVAL_SCORE_BATCH_SIZE
public static final String AVERAGE_EACH_ITERATION
public static final String ACCUM_GRADIENT
public static final String DIVIDE_ACCUM_GRADIENT
public SparkDl4jMultiLayer(org.apache.spark.SparkContext sparkContext,
MultiLayerNetwork network)
sparkContext - the spark context to usenetwork - the network to usepublic SparkDl4jMultiLayer(org.apache.spark.api.java.JavaSparkContext javaSparkContext,
MultiLayerNetwork network)
public SparkDl4jMultiLayer(org.apache.spark.SparkContext sparkContext,
MultiLayerConfiguration conf)
sparkContext - the spark context to useconf - the configuration of the networkpublic SparkDl4jMultiLayer(org.apache.spark.api.java.JavaSparkContext sc,
MultiLayerConfiguration conf)
sc - the spark context to useconf - the configuration of the networkpublic MultiLayerNetwork fit(String path, int labelIndex, org.canova.api.records.reader.RecordReader recordReader)
RecordReader.
This method splits the entire data set at oncepath - the path to the text filelabelIndex - the label indexrecordReader - the record reader to parse resultsMultiLayerNetworkfit(String, int, RecordReader, int, int, int)public MultiLayerNetwork fit(String path, int labelIndex, org.canova.api.records.reader.RecordReader recordReader, int examplesPerFit, int numPartitions)
RecordReader.
This method splits the data into approximately examplesPerFit sized splits, and trains on each split.
one after the other. See fitDataSet(JavaRDD, int, int, int) for further details.fit(String, int, RecordReader, int, int, int), this method persists and then counts the data set
size directly. This is usually OK, though if the data set does not fit in memory, this can result in some overhead due
to the data being loaded multiple times (once for count, once for fitting), as compared to providing the data set
size to the fit(String, int, RecordReader, int, int, int) methodpath - the path to the text filelabelIndex - the label indexrecordReader - the record reader to parse resultsexamplesPerFit - Number of examples to fit on at each iterationnumPartitions - Number of partitions to divide each subset of the data into (for best results, this should be
equal to the number of executors)MultiLayerNetworkpublic MultiLayerNetwork fit(String path, int labelIndex, org.canova.api.records.reader.RecordReader recordReader, int examplesPerFit, int totalExamples, int numPartitions)
RecordReader.
This method splits the data into approximately examplesPerFit sized splits, and trains on each split.
one after the other. See fitDataSet(JavaRDD, int, int, int) for further details.path - the path to the text filelabelIndex - the label indexrecordReader - the record reader to parse resultsexamplesPerFit - Number of examples to fit on at each iteration (divided between all executors)numPartitions - Number of partitions to divide each subset of the data into (for best results, this should be
equal to the number of executors)MultiLayerNetworkfit(String, int, RecordReader, int, int)public MultiLayerNetwork getNetwork()
public void setNetwork(MultiLayerNetwork network)
public org.apache.spark.mllib.linalg.Matrix predict(org.apache.spark.mllib.linalg.Matrix features)
features - the given feature matrixpublic org.apache.spark.mllib.linalg.Vector predict(org.apache.spark.mllib.linalg.Vector point)
point - the vector to predictpublic MultiLayerNetwork fit(org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> rdd, int batchSize)
rdd - the rdd to fitDataSetpublic MultiLayerNetwork fit(org.apache.spark.api.java.JavaSparkContext sc, org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> rdd)
sc - the org.deeplearning4j.spark contextrdd - the rdd to fitDataSetpublic MultiLayerNetwork fitDataSet(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> rdd, int examplesPerFit, int numPartitions)
fitDataSet(JavaRDD, int, int, int), but persist and count the size of the data set first,
instead of requiring the data set size to be provided externally.
Note: In some cases, it may be more efficient to count the size of the data set earlier in the pipeline and
provide this count to the fitDataSet(JavaRDD, int, int, int) method, as counting on the JavaRDD<DataSet>
requires a full pass of the data pipeline. In cases where the entire JavaRDD<DataSet> does not fit in memory, this
approach can result in multiple passes being done over the data, potentially degrading performancerdd - Data to train onexamplesPerFit - Number of examples to learn on (between averaging) across all executors. For example, if set to
1000 and rdd.count() == 10k, then we do 10 sets of learning, each on 1000 examples.
To use all examples, set maxExamplesPerFit to Integer.MAX_VALUEnumPartitions - number of partitions to divide the data in to. For best results, this should be equal to the number
of executorspublic MultiLayerNetwork fitDataSet(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> rdd, int examplesPerFit, int totalExamples, int numPartitions)
JavaRDD<DataSet>s)
to be trained as a set of smaller steps instead of all together.examplesPerFit examples -> average parameters -> train on examplesPerFit -> average
parameters etc until entire data set has been processedexamplesPerFit=1000, with rdd.count()=1200. Then, we round up to 2000 examples, and the
network will then be fit in two steps (as 2000/1000=2), with 1200/2=600 examples at each step. These 600 examples
will then be distributed approximately equally (no guarantees) amongst each executor/core for training.rdd - Data to train onexamplesPerFit - Number of examples to learn on (between averaging) across all executors. For example, if set to
1000 and rdd.count() == 10k, then we do 10 sets of learning, each on 1000 examples.
To use all examples, set maxExamplesPerFit to Integer.MAX_VALUEtotalExamples - total number of examples in the data RDDnumPartitions - number of partitions to divide the data in to. For best results, this should be equal to the
number of executorspublic MultiLayerNetwork fitDataSet(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> rdd)
rdd - the rdd to fitDataSetprotected void runIteration(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> rdd)
public static MultiLayerNetwork train(org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> data, MultiLayerConfiguration conf)
data - the data to train onconf - the configuration of the networkpublic void setListeners(@NonNull
Collection<IterationListener> listeners)
listeners - protected void invokeListeners(MultiLayerNetwork network, int iteration)
public double getScore()
public double calculateScore(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean average)
public org.apache.spark.api.java.JavaDoubleRDD scoreExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms)
DEFAULT_EVAL_SCORE_BATCH_SIZE. Unlike calculateScore(JavaRDD, boolean),
this method returns a score for each example separately. If scoring is needed for specific examples use either
scoreExamples(JavaPairRDD, boolean) or scoreExamples(JavaPairRDD, boolean, int) which can have
a key for each example.data - Data to scoreincludeRegularizationTerms - If true: include the l1/l2 regularization terms with the score (if any)MultiLayerNetwork.scoreExamples(DataSet, boolean)public org.apache.spark.api.java.JavaDoubleRDD scoreExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms,
int batchSize)
calculateScore(JavaRDD, boolean),
this method returns a score for each example separately. If scoring is needed for specific examples use either
scoreExamples(JavaPairRDD, boolean) or scoreExamples(JavaPairRDD, boolean, int) which can have
a key for each example.data - Data to scoreincludeRegularizationTerms - If true: include the l1/l2 regularization terms with the score (if any)batchSize - Batch size to use when doing scoringMultiLayerNetwork.scoreExamples(DataSet, boolean)public <K> org.apache.spark.api.java.JavaPairRDD<K,Double> scoreExamples(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.dataset.DataSet> data, boolean includeRegularizationTerms)
DEFAULT_EVAL_SCORE_BATCH_SIZE. Unlike calculateScore(JavaRDD, boolean),
this method returns a score for each example separatelyK - Key typedata - Data to scoreincludeRegularizationTerms - If true: include the l1/l2 regularization terms with the score (if any)JavaPairRDD<K,Double> containing the scores of each exampleMultiLayerNetwork.scoreExamples(DataSet, boolean)public <K> org.apache.spark.api.java.JavaPairRDD<K,Double> scoreExamples(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.dataset.DataSet> data, boolean includeRegularizationTerms, int batchSize)
calculateScore(JavaRDD, boolean),
this method returns a score for each example separatelyK - Key typedata - Data to scoreincludeRegularizationTerms - If true: include the l1/l2 regularization terms with the score (if any)JavaPairRDD<K,Double> containing the scores of each exampleMultiLayerNetwork.scoreExamples(DataSet, boolean)public Evaluation evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
data - Data to evaluate onpublic Evaluation evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, List<String> labelsList)
data - Data to evaluate onlabelsList - List of labels used for evaluationpublic Evaluation evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, List<String> labelsList, int evalBatchSize)
data - Data to evaluate onlabelsList - List of labels used for evaluationevalBatchSize - Batch size to use when conducting evaluationsCopyright © 2016. All Rights Reserved.