Run on Google Cloud Dataproc

Before You Start

Before using BigDL on Google Dataproc, you need setup a project and create a cluster on Dataproc(you may refer to https://cloud.google.com/sdk/docs/how-to for more instructions).

Please disable spark.dynamicAllocation when creating cluster. E.g.,

gcloud dataproc clusters create bigdl --project $PROJECT_NAME --worker-machine-type $MACHINETYPE --master-machine-type $MACHINETYPE --num-workers $WORKERNUMBER --properties spark:spark.dynamicAllocation.enabled=false

Or please set --conf "spark.dynamicAllocation.enabled=false" when submitting spark jobs

Download BigDL

BigDL can be downloaded from https://bigdl-project.github.io/master/#release-download. It provides prebuild binary for different Spark versions. Please ssh on the master VM Instance and download BigDL according to Spark version.

wget BIGDL_LINK

After downloading BigDL, you will be able to find a zip file,eg.dist-spark-2.2.0-scala-2.11.8-0.4.0-dist.zip. Unzip the file

unzip xxx.zip

Run BigDL Scala Examples

Now you can run BigDL examples on Google Dataproc. For instance, you may use the run.example.sh script which is located under ./bin directory with following parameters:

Mandatory parameters:
- -m|--model which model to train, including
  - lenet: train the LeNet example
  - vgg: train the VGG example
  - inception-v1: train the Inception v1 example
  - perf: test the training speed using the Inception v1 model with dummy data
- -s|--spark-url the master URL for the Spark cluster
- -n|--nodes number of Spark slave nodes
- -o|--cores number of cores used on each node
- -r|--memory memory used on each node, e.g. 200g
- -b|--batch-size batch size when training the model; it is expected to be a multiple of "nodes * cores"
- -f|--hdfs-data-dir HDFS directory for the input images (for the "inception-v1" model training only)
Optional parameters:
- -e|--max-epoch the maximum number of epochs (i.e., going through all the input data once) used in the training; default to 90 if not specified
- -p|--spark by default the example will run with Spark 1.5 or 1.6; to use Spark 2.0, please specify "spark_2.0" here (it is highly recommended to use Java 8 when running BigDL for Spark 2.0, otherwise you may observe very poor performance)
- -l|--learning-rate by default the the example will use an initial learning rate of "0.01"; you can specify a different value here

After the training, you can check the log files and generated models

Replace $BIGDLJAR with bigdl binary name in ./lib in below command, eg: bigdl-SPARK_2.2-0.4.0-jar-with-dependencies.jar

./bin/run.example.sh --model lenet --nodes 2 --cores 2 --memory 1g --batch-size 16 -j lib/$BIGDLJAR -p spark_buildIn

You can also run lenet examples in below command. Before submit below command, please make sure you have already downloaded mnist and put it under mnist directory, more detail see https://github.com/intel-analytics/BigDL/tree/master/spark/dl/src/main/scala/com/intel/analytics/bigdl/models/lenet:

spark-submit \
 --executor-cores 2 \
 --num-executors 2 \
 --driver-class-path ./lib/$BIGDLJAR \
 --class com.intel.analytics.bigdl.models.lenet.Train \
 ./lib/$BIGDLJAR \
 -f ./mnist \
 -b 16

Run BigDL Python example

Download lenet5.py from https://github.com/intel-analytics/BigDL/blob/master/pyspark/bigdl/models/lenet/lenet5.py

wget https://raw.githubusercontent.com/intel-analytics/BigDL/master/pyspark/bigdl/models/lenet/lenet5.py

Replace $BIGDLJAR with bigdl binary name in ./lib, eg: bigdl-SPARK_2.2-0.4.0-jar-with-dependencies.jar
Replace $BIGDL_PYTHON_ZIP with bigdl python binary name in ./lib, eg: bigdl-0.4.0-python-api.zip

PYTHON_API_ZIP_PATH=./lib/$BIGDL_PYTHON_ZIP
BigDL_JAR_PATH=./lib/$BIGDLJAR
PYTHONPATH=${PYTHON_API_ZIP_PATH}:$PYTHONPATH
spark-submit \
        --driver-cores 2  \
        --driver-memory 2g  \
        --num-executors 2  \
        --executor-cores 2  \
        --executor-memory 4g \
        --py-files ${PYTHON_API_ZIP_PATH},./lenet5.py  \
        --properties-file ./conf/spark-bigdl.conf \
        --jars ${BigDL_JAR_PATH} \
        --conf spark.driver.extraClassPath=${BigDL_JAR_PATH} \
        --conf spark.executor.extraClassPath=${BigDL_JAR_PATH} \
        ./lenet5.py \
        --action train