Installing PySpark with Jupyter notebook on Ubuntu 18.04 LTS

Upasana | December 07, 2019 | 4 min read | 1,534 views

In this tutorial we will learn how to install and work with PySpark on Jupyter notebook on Ubuntu Machine and build a jupyter server by exposing it using nginx reverse proxy over SSL. This way, jupyter server will be remotely accessible.

Table of contents

Setup Virtual Environment
Setup Jupyter notebook
Jupyter Server Setup
PySpark setup
Configure bash profile
Setup Jupyter notebook as a service on Ubuntu 18.0 LTS
Nginx Setup
SSL setup using LetsEncrypt

Virtual Environment Setup

Run the below command on the terminal to install virtual environment on your machine, if it is not there already. We will be using virtualenv to setup virtual environment.

Install virtualenv using pip

$ pip install virtualenv

Create virtual environment

$ virtualenv -p python3.6 venv

where venv is the name of the virtual environment. Above command will create a virtual environment in the current directory with name venv

To activate this newly create virtual environment, you need to run the below command

Activate virutal environment

$ source venv/bin/activate

virtual env activate

Install jupyter notebook

To install jupyter notebook, run the below command. Make sure that virtual environment is activated when you run the below command.

Install Jupyter Notebook

$ pip install jupyter notebook

Jupyter Server Setup

Now, we will be setting up the password for jupyter notebook.

Generate config for jupyter notebook using following command:

$ jupyter notebook --generate-config

Update the config:

$ vi /home/<username>/.jupyter/jupyter_notebook_config.py

Set password for the jupyter notebook

## Hashed password to use for web authentication.
#
#  To generate, type in a python/IPython shell:
#
#    from notebook.auth import passwd; passwd()
#
#  The string should be of the form type:salt:hashed-password.
c.NotebookApp.password = u'sha1:020f1412ae63:227357c88b3996e75dcf85ea96c2d581db74ec1e'

Allow remote host access

## Allow requests where the Host header doesn't point to a local server
#
#  By default, requests get a 403 forbidden response if the 'Host' header shows
#  that the browser thinks it's on a non-local domain. Setting this option to
#  True disables this check.
#
#  This protects against 'DNS rebinding' attacks, where a remote web server
#  serves you a page and then changes its DNS to send later requests to a local
#  IP, bypassing same-origin checks.
#
#  Local IP addresses (such as 127.0.0.1 and ::1) are allowed as local, along
#  with hostnames configured in local_hostnames.
c.NotebookApp.allow_remote_access = True

PySpark Setup

We will install PySpark using PyPi. To install just run the following command from inside the virtual environment:

Install PySpark using PyPi

$ pip install pyspark

For more information, see this web page: https://spark.apache.org/downloads.html

As of writing this article, v2.4.4 is the latest version of Apache Spark available with scala version 2.11.12

Check the installation using following command

$ spark-shell --version

Configure environment using Bash profile

You need to set following enviornment variables in bashrc located under your home directory.

~/.bashrc

export SPARK_HOME=/home/<username>/build/jupyter/venv/lib/python3.6/site-packages/pyspark/
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Reload the bash profile changes:

$ source ~/.bashrc

Now we can start the Jupyter notebook from command line:

$ pyspark

or using this command:

$ jupyter notebook

Run Pyspark on jupyter notebook

Open a general python3 notebook on the jupyter server. We don’t need pyspark kernel as we will be using findspark to find spark home.

findspark

import findspark

findspark.find()
findspark.init()
import pyspark

Parallelization with Pyspark

import random

sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000

def inside(p):
  x, y = random.random(), random.random()
  return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()

pi = 4 * count / num_samples
print(pi)

sc.stop()

In reference to jupyter notebook

Setup Jupyter notebook as a service in Ubuntu 18.04 LTS

We need a Systemd Service in order to allow jupyter notebook to be run as a background service.

/etc/systemd/system/jupyter.service

[Unit]
Description=Jupyter Notebook
[Service]
Type=simple
PIDFile=/run/jupyter.pid
ExecStart=/bin/bash -c ". /home/<username>/build/jupyter/venv/bin/activate;jupyter-notebook --notebook-dir=/home/<username>/my-notebooks"
User=<username>
Group=<username>
WorkingDirectory=/home/<username>/my-notebooks
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target

Enable this service

$ sudo systemctl enable jupyter.service

Reload the systemctl daemon

$ sudo systemctl daemon-reload

Start the service

$ sudo systemctl start jupyter.service

Stop the service

$ sudo systemctl stop jupyter.service

Nginx setup as a reverse proxy

We need to configure HTTP/1.1 and websocket support in order to expose jupyter notebook through nginx proxy server.

The following nginx configuration is required to run jupyter through nginx proxy.

/etc/nginx/sites-available/default

server {
    server_name <dns-name>;

    location / {
            proxy_pass http://localhost:8888;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;

            client_max_body_size 10M;
            proxy_http_version 1.1;
            proxy_set_header Upgrade "websocket";
            proxy_set_header Connection "Upgrade";
            proxy_read_timeout 86400;
    }
}

SSL setup using Free SSL

LetsEncrypt provides free SSL certificate that can be used for securing our site with HTTPS.

ebook PDF - Cracking Java Interviews v3.5 by Munish Chandel

Book you may be interested in..

ebook PDF - Cracking Spring Microservices Interviews for Java Developers

Find more on this topic:

Machine Learning

Data science, machine learning, python, R, big data, spark, the Jupyter notebook, and much more

Last updated 1 week ago

Subscribe to Interview Questions

Do you like cookies? 🍪 We use cookies to ensure you get the best experience on our website. Learn more