Without pip install
Precondition
First of all, you need to obtain the BigDL libs. Refer to Install from pre built or Install from source code for more details
Remark
- Only Python 2.7, Python 3.5 and Python 3.6 are supported for now.
- Note that Python 3.6 is only compatible with Spark 1.6.4, 2.0.3, 2.1.1 and 2.2.0. See this issue for more discussion.
Set Environment Variables
Set BIGDL_HOME and SPARK_HOME:
- If you download BigDL from the Release Page
export SPARK_HOME=folder path where you extract the spark package
export BIGDL_HOME=folder path where you extract the bigdl package
- If you build BigDL by yourself
export SPARK_HOME=folder path where you extract the spark package
export BIGDL_HOME=the dist folder generated by the build process, which is under the top level of the source folder
Update spark-bigdl.conf (Optional)
If you have some customized properties in some files, which is used with the --properties-file option in spark-submit/pyspark, add these customized properties into ${BIGDL_HOME}/conf/spark-bigdl.conf.
Run with pyspark
${BIGDL_HOME}/bin/pyspark-with-bigdl.sh --master local[*]
--master
set the master URL to connect to--jars
if there are extra jars needed.--py-files
if there are extra python packages needed.
You can also specify other options available for pyspark in the above command if needed.
Example code to verify if BigDL can run successfully
Run with spark-submit
A BigDL Python program runs as a standard pyspark program, which requires all Python dependencies (e.g., NumPy) used by the program to be installed on each node in the Spark cluster. You can try running the BigDL lenet Python example as follows:
${BIGDL_HOME}/bin/spark-submit-with-bigdl.sh --master local[4] lenet5.py
Run with Jupyter
With the full Python API support in BigDL, users can use BigDL together with powerful notebooks (such as Jupyter notebook) in a distributed fashion across the cluster, combining Python libraries, Spark SQL / dataframes and MLlib, deep learning models in BigDL, as well as interactive visualization tools.
Prerequisites: Install all the necessary libraries on the local node where you will run Jupyter, e.g.,
sudo apt install python
sudo apt install python-pip
sudo pip install numpy scipy pandas scikit-learn matplotlib seaborn wordcloud
Launch the Jupyter notebook as follows:
${BIGDL_HOME}/bin/jupyter-with-bigdl.sh --master local[*]
--master
set the master URL to connect to--jars
if there are extra jars needed.--py-files
if there are extra python packages needed.
You can also specify other options available for pyspark in the above command if needed.
After successfully launching Jupyter, you will be able to navigate to the notebook dashboard using your browser. You can find the exact URL in the console output when you started Jupyter; by default, the dashboard URL is http://your_node:8888/
Example code to verify if BigDL can run successfully
Run with virtual environment in Yarn
If you already created BigDL dependency virtual environment according to Yarn cluster guide in install without pip , you can run python program using BigDL as following examples.
- Note: please set BigDL_HOME, SPARK_HOME environment. Set VENV_HOME to the parent directory of venv.zip and venv directory. Replace VERSION with your BigDL version, like 0.5.0. If you don't install BigDL from source, replace ${BigDL_HOME}/pyspark/bigdl/examples/lenet/lenet.py with your python program which is using BigDL.
- Yarn cluster mode
BigDL_HOME=
SPARK_HOME=
PYTHON_API_PATH=${BigDL_HOME}/dist/lib/bigdl-VERSION-python-api.zip
BigDL_JAR_PATH=${BigDL_HOME}/dist/lib/bigdl-VERSION-jar-with-dependencies.jar
PYTHONPATH=${PYTHON_API_PATH}:$PYTHONPATH
VENV_HOME=
PYSPARK_PYTHON=./venv.zip/venv/bin/python ${SPARK_HOME}/bin/spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv.zip/venv/bin/python \
--master yarn-cluster \
--executor-memory 10g \
--driver-memory 10g \
--executor-cores 8 \
--num-executors 2 \
--properties-file ${BigDL_HOME}/dist/conf/spark-bigdl.conf \
--jars ${BigDL_JAR_PATH} \
--py-files ${PYTHON_API_PATH} \
--archives ${VENV_HOME}/venv.zip \
--conf spark.driver.extraClassPath=bigdl-VERSION-jar-with-dependencies.jar \
--conf spark.executor.extraClassPath=bigdl-VERSION-jar-with-dependencies.jar \
${BigDL_HOME}/pyspark/bigdl/examples/lenet/lenet.py
-
Yarn client mode ``` BigDL_HOME= SPARK_HOME= PYTHON_API_PATH=${BigDL_HOME}/dist/lib/bigdl-VERSION-python-api.zip BigDL_JAR_PATH=${BigDL_HOME}/dist/lib/bigdl-VERSION-jar-with-dependencies.jar PYTHONPATH=${PYTHON_API_PATH}:$PYTHONPATH VENV_HOME=
PYSPARK_DRIVER_PYTHON=${VENV_HOME}/venv/bin/python PYSPARK_PYTHON=./venv.zip/venv/bin/python ${SPARK_HOME}/bin/spark-submit \ --master yarn \ --deploy-mode client \ --executor-memory 10g \ --driver-memory 10g \ --executor-cores 16 \ --num-executors 2 \ --properties-file ${BigDL_HOME}/dist/conf/spark-bigdl.conf \ --jars ${BigDL_JAR_PATH} \ --py-files ${PYTHON_API_PATH} \ --archives ${VENV_HOME}/venv.zip \ --conf spark.driver.extraClassPath=${BigDL_JAR_PATH} \ --conf spark.executor.extraClassPath=bigdl-VERSION-jar-with-dependencies.jar \ ${BigDL_HOME}/pyspark/bigdl/examples/lenet/lenet.py ```
BigDL Configuration
Please check this page