Without pip install


Precondition

First of all, you need to obtain the BigDL libs. Refer to Install from pre built or Install from source code for more details

Remark

Set Environment Variables

Set BIGDL_HOME and SPARK_HOME:

export SPARK_HOME=folder path where you extract the spark package
export BIGDL_HOME=folder path where you extract the bigdl package
export SPARK_HOME=folder path where you extract the spark package
export BIGDL_HOME=the dist folder generated by the build process, which is under the top level of the source folder

Update spark-bigdl.conf (Optional)

If you have some customized properties in some files, which is used with the --properties-file option in spark-submit/pyspark, add these customized properties into ${BIGDL_HOME}/conf/spark-bigdl.conf.

Run with pyspark

${BIGDL_HOME}/bin/pyspark-with-bigdl.sh --master local[*]

You can also specify other options available for pyspark in the above command if needed.

Example code to verify if BigDL can run successfully

Run with spark-submit

A BigDL Python program runs as a standard pyspark program, which requires all Python dependencies (e.g., NumPy) used by the program to be installed on each node in the Spark cluster. You can try running the BigDL lenet Python example as follows:

${BIGDL_HOME}/bin/spark-submit-with-bigdl.sh --master local[4] lenet5.py

Run with Jupyter

With the full Python API support in BigDL, users can use BigDL together with powerful notebooks (such as Jupyter notebook) in a distributed fashion across the cluster, combining Python libraries, Spark SQL / dataframes and MLlib, deep learning models in BigDL, as well as interactive visualization tools.

Prerequisites: Install all the necessary libraries on the local node where you will run Jupyter, e.g.,

sudo apt install python
sudo apt install python-pip
sudo pip install numpy scipy pandas scikit-learn matplotlib seaborn wordcloud

Launch the Jupyter notebook as follows:

${BIGDL_HOME}/bin/jupyter-with-bigdl.sh --master local[*]

You can also specify other options available for pyspark in the above command if needed.

After successfully launching Jupyter, you will be able to navigate to the notebook dashboard using your browser. You can find the exact URL in the console output when you started Jupyter; by default, the dashboard URL is http://your_node:8888/

Example code to verify if BigDL can run successfully

Run with virtual environment in Yarn

If you already created BigDL dependency virtual environment according to Yarn cluster guide in install without pip , you can run python program using BigDL as following examples.

    BigDL_HOME=
    SPARK_HOME=
    PYTHON_API_PATH=${BigDL_HOME}/dist/lib/bigdl-VERSION-python-api.zip
    BigDL_JAR_PATH=${BigDL_HOME}/dist/lib/bigdl-VERSION-jar-with-dependencies.jar
    PYTHONPATH=${PYTHON_API_PATH}:$PYTHONPATH
    VENV_HOME=

    PYSPARK_PYTHON=./venv.zip/venv/bin/python ${SPARK_HOME}/bin/spark-submit \
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv.zip/venv/bin/python \
    --master yarn-cluster \
    --executor-memory 10g \
    --driver-memory 10g \
    --executor-cores 8 \
    --num-executors 2 \
    --properties-file ${BigDL_HOME}/dist/conf/spark-bigdl.conf \
    --jars ${BigDL_JAR_PATH} \
    --py-files ${PYTHON_API_PATH} \
    --archives ${VENV_HOME}/venv.zip \
    --conf spark.driver.extraClassPath=bigdl-VERSION-jar-with-dependencies.jar \
    --conf spark.executor.extraClassPath=bigdl-VERSION-jar-with-dependencies.jar \
    ${BigDL_HOME}/pyspark/bigdl/examples/lenet/lenet.py

BigDL Configuration

Please check this page