Set Environment Variables
To achieve high performance, BigDL uses Intel MKL and multi-threaded programming; therefore, you need to first set the environment variables by sourcing the provided script, as follows:
$ source PATH_To_BigDL/scripts/bigdl.sh
Use Interactive Spark Shell
First, set environmental variables as described in Set Environment Variables.
Then you can try BigDL easily using the Spark interactive shell. Run below command to start spark shell with BigDL support:
$ SPARK_HOME/bin/spark-shell --properties-file dist/conf/spark-bigdl.conf \
--jars bigdl-VERSION-jar-with-dependencies.jar
You will see a welcome message looking like below:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79)
Spark context available as sc.
scala>
To use BigDL, you should first initialize the engine as below.
scala>import com.intel.analytics.bigdl.utils.Engine
scala>Engine.init
Once the engine is successfully initialted, you'll be able to play with BigDL API's.
For instance, to experiment with the Tensor
APIs in BigDL, you may try below code:
scala> import com.intel.analytics.bigdl.tensor.Tensor
import com.intel.analytics.bigdl.tensor.Tensor
scala> Tensor[Double](2,2).fill(1.0)
res9: com.intel.analytics.bigdl.tensor.Tensor[Double] =
1.0 1.0
1.0 1.0
[com.intel.analytics.bigdl.tensor.DenseTensor of size 2x2]
Run as a Spark Program
First, set environmental variables as described in Set Environment Variables.
Then you can run a BigDL program, e.g., the VGG training, as a standard Spark program (running in either local mode or cluster mode) as follows:
1.Download the CIFAR-10 data from here. Remember to choose the binary version.
2.Run the VGG training.
# Spark local mode
spark-submit --master local[core_number] --class com.intel.analytics.bigdl.models.vgg.Train \
${BIGDL_HOME}/dist/lib/bigdl-VERSION-jar-with-dependencies.jar \
-f path_to_your_cifar_folder \
-b batch_size
# Spark standalone mode
spark-submit --master spark://... --executor-cores cores_per_executor \
--total-executor-cores total_cores_for_the_job \
--class com.intel.analytics.bigdl.models.vgg.Train \
dist/lib/bigdl-VERSION-jar-with-dependencies.jar \
-f path_to_your_cifar_folder \
-b batch_size
# Spark yarn mode
spark-submit --master yarn --deploy-mode client \
--executor-cores cores_per_executor \
--num-executors executors_number \
--class com.intel.analytics.bigdl.models.vgg.Train \
dist/lib/bigdl-VERSION-jar-with-dependencies.jar \
-f path_to_your_cifar_folder \
-b batch_size
The parameters used in the above command are:
-
-f: The folder where your put the CIFAR-10 data set. Note in this example, this is just a local file folder on the Spark driver; as the CIFAR-10 data is somewhat small (about 120MB), we will directly send it from the driver to executors in the example.
-
-b: The mini-batch size. The mini-batch size is expected to be a multiple of total cores used in the job. In this example, the mini-batch size is suggested to be set to total cores * 4
If you are to run your own program, do remember to create SparkContext and initialize the engine before call other BigDL API's, as shown below.
// Scala code example
val conf = Engine.createSparkConf()
val sc = new SparkContext(conf)
Engine.init