Run on Kubernetes


Spark can run on clusters managed by Kubernetes. This feature makes use of native Kubernetes scheduler that has been added to Spark 2.3. Now as a deep learning library for Apache Spark, BigDL can also run on Kubernetes by leveraging Spark on Kubernetes.


Prerequisites


Docker image

For Spark2.3, BigDL ships with a Dockerfile that can be found in the docker/spark2.3-k8s directory.

To built it, copy the docker/spark2.3-k8s folder under kubernetes/dockerfiles/ your unzipped spark 2.3 folder.

Then the docker build command should be invoked from the top level directory of the Spark distribution. E.g.: docker build -t bigdl-spark2.3-k8s:latest -f kubernetes/dockerfiles/spark2.3-k8s/Dockerfile .

You can set your own image name and tag. We'll just use bigdl-spark2.3-k8s:latest as the name and tag for demonstration in the following usage example.


Run BigDL examples

Now, let's go on a quick tour on how to run BigDL Lenet5 example with a local k8s cluster created by minikube.

Build your BigDL on Kubernetes image

Follow the instructions under Docker Image section above, and you can have your pre-built docker image bigdl-spark2.3-k8s:latest.

Now you can tag it and push it to your docker hub for your k8s cluster's docker deamon to pull it. Something like: ```$shell

tag the image for yourself

docker tag bigdl-spark2.3-k8s my-repo-name/bigdl-spark2.3-k8s

push

docker push my-repo-name/bigdl-spark2.3-k8s

Or If we test with minikube, you can build your docker image using minikube since it will do so 
directly into minikube's Docker daemon. There is no need to push the images into minikube in that 
case, they'll be automatically available when running applications inside the minikube cluster.

### Download minikube
Follow the instructions from the [minikube chapter](https://kubernetes.io/docs/tasks/tools/install-minikube/) 
in the Kubernetes official document and install minikube on your client machine.

After installation, run ```minikube start```. You can see the terminal output like this:
```shell
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
Loading cached images from config file.

To further verify your minikube's local k8s cluster is correctly launched, run kubetcl cluster-info. It tells you where the kubernetes master DNS is running at. Sample output should be like:

Kubernetes master is running at https://192.168.99.100:8443
KubeDNS is running at https://192.168.99.100:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

create Kubernetes service account

kubectl create spark(you can replace it with your preferred name)
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

The reason why we need to do this is illustrated here. In simple words, the default service account may not allow driver pods to create pods so we need to create another one granted with access.

Run the Letnet5 application

It's been a lot of setup work that might make your palm sweaty but finally we are ready to launch the application.

Run the script below to train lenet5 model for only 2 epochs on MNIST dataset as a demo.

SPARK_HOME=...(your spark 2.3 home directory)
$SPARK_HOME/bin/spark-submit \
    --master k8s://your-master-ip(found by kubertcl cluster-info)\
    --deploy-mode cluster \
    --name bigdl-lenet5 \
    --class com.intel.analytics.bigdl.models.lenet.Train \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark(Or your own service account name) \
    --conf spark.executor.instances=4 \
    --conf spark.executor.cores=1 \
    --conf spark.cores.max=4 \
    --conf spark.kubernetes.container.image=docker.io/my-repo-name/bigdl-spark2.3-k8s:latest \
    local:///opt/bigdl-0.6.0/lib/bigdl-SPARK_2.3-0.6.0-jar-with-dependencies.jar \
    -f hdfs://path-to-your-mnist \
    -b 128 \
    -e 2 \
    --checkpoint /tmp

In the above commands -f: where you put your MNIST data --checkpoint: Where you cache the model/train_state snapshot. You should input a folder and make sure the folder is created when you run this example. The model snapshot will be named as model.#iteration_number, and train state will be named as state.#iteration_number. Note that if there are some files already exist in the folder, the old file will not be overwrite for the safety of your model files. * -b: The mini-batch size. It is expected that the mini-batch size is a multiple of node_number * core_number.

You can find more information about this example in BigDL lenet folder.


Future Work

As there are several Spark on Kubernetes features that await to be added into the future versions of Spark. See more here. Some of those such as Pyspark and Local File Dependency Management BigDL will also support by leveraging Spark, and the document, docker image plus script to run the examples so please stay tuned. We welcome you send your questions in our BigDL user group during the usage.