Run on Kubernetes
Spark can run on clusters managed by Kubernetes. This feature makes use of native Kubernetes scheduler that has been added to Spark 2.3. Now as a deep learning library for Apache Spark, BigDL can also run on Kubernetes by leveraging Spark on Kubernetes.
Download the spark2.3 release from Spark and unzip it.
BigDL ships with a Dockerfile that can be found in the
To built it, copy the
docker/spark2.3-k8s folder under
kubernetes/dockerfiles/ your unzipped spark 2.3 folder.
Then the docker build
command should be invoked from the top level directory of the Spark distribution. E.g.:
docker build -t bigdl-spark2.3-k8s:latest -f kubernetes/dockerfiles/spark2.3-k8s/Dockerfile .
You can set your own image name and tag. We'll just use
bigdl-spark2.3-k8s:latest as the name and tag for
demonstration in the following usage example.
Run BigDL examples
Now, let's go on a quick tour on how to run BigDL Lenet5 example with a local k8s cluster created by minikube.
Build your BigDL on Kubernetes image
Follow the instructions under Docker Image section above, and you can have your pre-built docker image
Now you can tag it and push it to your docker hub for your k8s cluster's docker deamon to pull it. Something like: ```$shell
tag the image for yourself
docker tag bigdl-spark2.3-k8s my-repo-name/bigdl-spark2.3-k8s
docker push my-repo-name/bigdl-spark2.3-k8s
Or If we test with minikube, you can build your docker image using minikube since it will do so directly into minikube's Docker daemon. There is no need to push the images into minikube in that case, they'll be automatically available when running applications inside the minikube cluster. ### Download minikube Follow the instructions from the [minikube chapter](https://kubernetes.io/docs/tasks/tools/install-minikube/) in the Kubernetes official document and install minikube on your client machine. After installation, run ```minikube start```. You can see the terminal output like this: ```shell Starting local Kubernetes v1.10.0 cluster... Starting VM... Getting VM IP address... Moving files into cluster... Setting up certs... Connecting to cluster... Setting up kubeconfig... Starting cluster components... Kubectl is now configured to use the cluster. Loading cached images from config file.
To further verify your minikube's local k8s cluster is correctly launched, run
It tells you where the kubernetes master DNS is running at. Sample output should be like:
Kubernetes master is running at https://192.168.99.100:8443 KubeDNS is running at https://192.168.99.100:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
create Kubernetes service account
kubectl create spark(you can replace it with your preferred name) kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
The reason why we need to do this is illustrated here.
In simple words, the
default service account may not allow driver pods to create pods so
we need to create another one granted with access.
Run the Letnet5 application
It's been a lot of setup work that might make your palm sweaty but finally we are ready to launch the application.
Run the script below to train lenet5 model for only 2 epochs on MNIST dataset as a demo.
SPARK_HOME=...(your spark 2.3 home directory) $SPARK_HOME/bin/spark-submit \ --master k8s://your-master-ip(found by kubertcl cluster-info)\ --deploy-mode cluster \ --name bigdl-lenet5 \ --class com.intel.analytics.bigdl.models.lenet.Train \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark(Or your own service account name) \ --conf spark.executor.instances=4 \ --conf spark.executor.cores=1 \ --conf spark.cores.max=4 \ --conf spark.kubernetes.container.image=docker.io/my-repo-name/bigdl-spark2.3-k8s:latest \ local:///opt/bigdl-0.6.0/lib/bigdl-SPARK_2.3-0.6.0-jar-with-dependencies.jar \ -f hdfs://path-to-your-mnist \ -b 128 \ -e 2 \ --checkpoint /tmp
In the above commands -f: where you put your MNIST data --checkpoint: Where you cache the model/train_state snapshot. You should input a folder and make sure the folder is created when you run this example. The model snapshot will be named as model.#iteration_number, and train state will be named as state.#iteration_number. Note that if there are some files already exist in the folder, the old file will not be overwrite for the safety of your model files. * -b: The mini-batch size. It is expected that the mini-batch size is a multiple of node_number * core_number.
You can find more information about this example in BigDL lenet folder.
As there are several Spark on Kubernetes features that await to be added into the future versions of Spark. See more here. Some of those such as Pyspark and Local File Dependency Management BigDL will also support by leveraging Spark, and the document, docker image plus script to run the examples so please stay tuned. We welcome you send your questions in our BigDL user group during the usage.