Run on Google Cloud Dataproc


Before You Start

Before using BigDL on Google Dataproc, you need setup a project and create a cluster on Dataproc(you may refer to https://cloud.google.com/sdk/docs/how-to for more instructions).

Please disable spark.dynamicAllocation when creating cluster. E.g.,

gcloud dataproc clusters create bigdl --project $PROJECT_NAME --worker-machine-type $MACHINETYPE --master-machine-type $MACHINETYPE --num-workers $WORKERNUMBER --properties spark:spark.dynamicAllocation.enabled=false

Or please set --conf "spark.dynamicAllocation.enabled=false" when submitting spark jobs


Download BigDL

BigDL can be downloaded from https://bigdl-project.github.io/master/#release-download. It provides prebuild binary for different Spark versions. Please ssh on the master VM Instance and download BigDL according to Spark version.

wget BIGDL_LINK

After downloading BigDL, you will be able to find a zip file,eg.dist-spark-2.2.0-scala-2.11.8-0.4.0-dist.zip. Unzip the file

unzip xxx.zip

Run BigDL Scala Examples

Now you can run BigDL examples on Google Dataproc. For instance, you may use the run.example.sh script which is located under ./bin directory with following parameters:

After the training, you can check the log files and generated models

Replace $BIGDLJAR with bigdl binary name in ./lib in below command, eg: bigdl-SPARK_2.2-0.4.0-jar-with-dependencies.jar

./bin/run.example.sh --model lenet --nodes 2 --cores 2 --memory 1g --batch-size 16 -j lib/$BIGDLJAR -p spark_buildIn

You can also run lenet examples in below command. Before submit below command, please make sure you have already downloaded mnist and put it under mnist directory, more detail see https://github.com/intel-analytics/BigDL/tree/master/spark/dl/src/main/scala/com/intel/analytics/bigdl/models/lenet:

spark-submit \
 --executor-cores 2 \
 --num-executors 2 \
 --driver-class-path ./lib/$BIGDLJAR \
 --class com.intel.analytics.bigdl.models.lenet.Train \
 ./lib/$BIGDLJAR \
 -f ./mnist \
 -b 16

Run BigDL Python example

Download lenet5.py from https://github.com/intel-analytics/BigDL/blob/master/pyspark/bigdl/models/lenet/lenet5.py

wget https://raw.githubusercontent.com/intel-analytics/BigDL/master/pyspark/bigdl/models/lenet/lenet5.py

Replace $BIGDLJAR with bigdl binary name in ./lib, eg: bigdl-SPARK_2.2-0.4.0-jar-with-dependencies.jar
Replace $BIGDL_PYTHON_ZIP with bigdl python binary name in ./lib, eg: bigdl-0.4.0-python-api.zip

PYTHON_API_ZIP_PATH=./lib/$BIGDL_PYTHON_ZIP
BigDL_JAR_PATH=./lib/$BIGDLJAR
PYTHONPATH=${PYTHON_API_ZIP_PATH}:$PYTHONPATH
spark-submit \
        --driver-cores 2  \
        --driver-memory 2g  \
        --num-executors 2  \
        --executor-cores 2  \
        --executor-memory 4g \
        --py-files ${PYTHON_API_ZIP_PATH},./lenet5.py  \
        --properties-file ./conf/spark-bigdl.conf \
        --jars ${BigDL_JAR_PATH} \
        --conf spark.driver.extraClassPath=${BigDL_JAR_PATH} \
        --conf spark.executor.extraClassPath=${BigDL_JAR_PATH} \
        ./lenet5.py \
        --action train