Installing PySpark with Jupyter notebook on Ubuntu 18.04 LTS

Upasana | December 07, 2019 | 4 min read | 1,534 views


In this tutorial we will learn how to install and work with PySpark on Jupyter notebook on Ubuntu Machine and build a jupyter server by exposing it using nginx reverse proxy over SSL. This way, jupyter server will be remotely accessible.
Table of contents
  1. Setup Virtual Environment

  2. Setup Jupyter notebook

  3. Jupyter Server Setup

  4. PySpark setup

  5. Configure bash profile

  6. Setup Jupyter notebook as a service on Ubuntu 18.0 LTS

  7. Nginx Setup

  8. SSL setup using LetsEncrypt

Virtual Environment Setup

Run the below command on the terminal to install virtual environment on your machine, if it is not there already. We will be using virtualenv to setup virtual environment.

Install virtualenv using pip
$ pip install virtualenv
Create virtual environment
$ virtualenv -p python3.6 venv

where venv is the name of the virtual environment. Above command will create a virtual environment in the current directory with name venv

To activate this newly create virtual environment, you need to run the below command

Activate virutal environment
$ source venv/bin/activate

virtual env activate

Install jupyter notebook

To install jupyter notebook, run the below command. Make sure that virtual environment is activated when you run the below command.

Install Jupyter Notebook
$ pip install jupyter notebook

Jupyter Server Setup

Now, we will be setting up the password for jupyter notebook.

Generate config for jupyter notebook using following command:

$ jupyter notebook --generate-config

Update the config:

$ vi /home/<username>/.jupyter/jupyter_notebook_config.py
Set password for the jupyter notebook
## Hashed password to use for web authentication.
#
#  To generate, type in a python/IPython shell:
#
#    from notebook.auth import passwd; passwd()
#
#  The string should be of the form type:salt:hashed-password.
c.NotebookApp.password = u'sha1:020f1412ae63:227357c88b3996e75dcf85ea96c2d581db74ec1e'
Allow remote host access
## Allow requests where the Host header doesn't point to a local server
#
#  By default, requests get a 403 forbidden response if the 'Host' header shows
#  that the browser thinks it's on a non-local domain. Setting this option to
#  True disables this check.
#
#  This protects against 'DNS rebinding' attacks, where a remote web server
#  serves you a page and then changes its DNS to send later requests to a local
#  IP, bypassing same-origin checks.
#
#  Local IP addresses (such as 127.0.0.1 and ::1) are allowed as local, along
#  with hostnames configured in local_hostnames.
c.NotebookApp.allow_remote_access = True

PySpark Setup

We will install PySpark using PyPi. To install just run the following command from inside the virtual environment:

Install PySpark using PyPi
$ pip install pyspark

For more information, see this web page: https://spark.apache.org/downloads.html

As of writing this article, v2.4.4 is the latest version of Apache Spark available with scala version 2.11.12

spark shell

Check the installation using following command

$ spark-shell --version

Configure environment using Bash profile

You need to set following enviornment variables in bashrc located under your home directory.

~/.bashrc
export SPARK_HOME=/home/<username>/build/jupyter/venv/lib/python3.6/site-packages/pyspark/
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Reload the bash profile changes:
$ source ~/.bashrc

Now we can start the Jupyter notebook from command line:

$ pyspark

or using this command:

$ jupyter notebook

Run Pyspark on jupyter notebook

Open a general python3 notebook on the jupyter server. We don’t need pyspark kernel as we will be using findspark to find spark home.

findspark
import findspark

findspark.find()
findspark.init()
import pyspark
Parallelization with Pyspark
import random

sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000

def inside(p):
  x, y = random.random(), random.random()
  return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()

pi = 4 * count / num_samples
print(pi)

sc.stop()
pyspark jupyter
In reference to jupyter notebook

Setup Jupyter notebook as a service in Ubuntu 18.04 LTS

We need a Systemd Service in order to allow jupyter notebook to be run as a background service.

/etc/systemd/system/jupyter.service
[Unit]
Description=Jupyter Notebook
[Service]
Type=simple
PIDFile=/run/jupyter.pid
ExecStart=/bin/bash -c ". /home/<username>/build/jupyter/venv/bin/activate;jupyter-notebook --notebook-dir=/home/<username>/my-notebooks"
User=<username>
Group=<username>
WorkingDirectory=/home/<username>/my-notebooks
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Enable this service
$ sudo systemctl enable jupyter.service
Reload the systemctl daemon
$ sudo systemctl daemon-reload
Start the service
$ sudo systemctl start jupyter.service
Stop the service
$ sudo systemctl stop jupyter.service

Nginx setup as a reverse proxy

We need to configure HTTP/1.1 and websocket support in order to expose jupyter notebook through nginx proxy server.

The following nginx configuration is required to run jupyter through nginx proxy.

/etc/nginx/sites-available/default
server {
    server_name <dns-name>;

    location / {
            proxy_pass http://localhost:8888;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;

            client_max_body_size 10M;
            proxy_http_version 1.1;
            proxy_set_header Upgrade "websocket";
            proxy_set_header Connection "Upgrade";
            proxy_read_timeout 86400;
    }
}

SSL setup using Free SSL

LetsEncrypt provides free SSL certificate that can be used for securing our site with HTTPS.


Top articles in this category:
  1. Introduction to Python 3.6 & Jupyter Notebook
  2. Top 100 interview questions on Data Science & Machine Learning
  3. Google Data Scientist interview questions with answers
  4. Part 2: Deploy Flask API in production using WSGI gunicorn with nginx reverse proxy
  5. Python coding challenges for interviews
  6. Google Colab: import data from google drive as pandas dataframe
  7. RuntimeError: get_session is not available when using TensorFlow 2.0

Recommended books for interview preparation:

Find more on this topic:
Buy interview books

Java & Microservices interview refresher for experienced developers.