sparknotebook
IMPORTANT
I am in the process of removing the IScala as it's development appears stalled. I'm replacing it with jupyter-scala. However, jupyter-scala doesn't build for Scala 2.10 yet. Spark requires Scala 2.10.
This project contains samples of jupyter notebooks running Spark. One notebook, 2-Text-Analytics.ipynb is written in python. The second, Scala-2-Text-Analytics.ipynb is in Scala. The dataset and the most excellent 2-Text-Analytics.ipynb are originally from https://github.com/xsankar/cloaked-ironman.
Just open each notebook to see how Spark is instantiated and used.
Python Set-up
To run the python notebook, you will need to:
- Install ipython and ipython notebook. For simplicity, I am just using the free Anaconda python distribution from Continuum Analytics.
- Download and install Spark distribution. The download includes the
pyspark
script that you need to launch python with Spark.
For best results, cd into this projects root directory before starting ipython. The actual command to start the ipython notebook is:
IPYTHON=1 IPYTHON_OPTS="notebook --pylab inline" pyspark
NOTE: Sometimes when running Spark on Java 7 you may get a java.net.UnknownHostException. I have not yet seen this on Java 8. If this happens to you, you can resolve it by setting the SPARK_LOCAL_IP environment variable to 127.0.0.1
before launching Spark. For example:
SPARK_LOCAL_IP=127.0.0.1 IPYTHON=1 IPYTHON_OPTS="notebook --pylab inline" pyspark
Scala Set-up
To run the scala notebook, you will need to:
- Install jupyter-scala
git clone https://github.com/alexarchambault/jupyter-scala.git
cd jupyter-scala
sbt cli/packArchive
sbt publishM2
# unpack cli/target/jupyter-scala_2.11.6-0.2.0-SNAPSHOT.zip
cd cli/target/jupyter-scala_2.11.6-0.2.0-SNAPSHOT/bin
./jupyter-scala
- Start jupyter (formerly ipython)
ipython notebook
# When the notebook starts you may need to select the "Scala 2.11" kernel
If you are running your notebook and it crashes with OutOfMemoryErrors you can increase the amount of memory used with the -Xmx
flag (e.g. -Xmx2g or -Xmx2048m will both allocate 2GB of memory for the JVM to use):
SBT_OPTS=-Xmx2048m ipython notebook --profile "Scala 2.11"
As with the python example, if you get a java.net.UnknownHostException when starting ipython use the following command:
SPARK_LOCAL_IP=127.0.0.1 SBT_OPTS=-Xmx2048m ipython notebook --profile "Scala 2.11"
NOTE: For the Scala notebook, you do not need to download and install Spark. The Spark dependencies are managed via sbt which is running under the hood in the Spark notebook.