Setup Apache Spark and Initialize pyspark with ease

bigdata Nov 9, 2016

Recently I started working on Apache Spark and was having a hard time configuring spark on my machine. It became all try and catch while setting up the environment. Then I came across this gist which explains how to set up Apache Spark with IPython Notebook on MacOS. I love to work on IPython notebooks but most of the times I just need a simple ipython shell. So I skipped the IPython notebook portion of the setup.

After tweaking a bit I got the desired setup. I came up with the below bash alias to use pyspark with ipython -

alias ipyspark='PYSPARK_DRIVER_PYTHON=ipython /usr/local/bin/pyspark --master local[*] --driver-memory 2g'

You can read more about the above parameters in the documentation here and use the parameters as per your requirement.

Now comes the best part. Once you are inside ipython shell you would want to initialize pyspark as fast as possible. For that I found out findspark. Just install findspark using pip install findspark and start using pyspark with ease -

>>> import findspark
>>> findspark.init()
>>> import pyspark
>>> sc = pyspark.SparkContext(appName="myAppName")

Happy hacking!

Related Reads -

Getting started with spark? Check out this repo - http://jadianes.github.io/spark-py-notebooks

MOOC on edx.org - https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x

Tags