• This repository has been archived on 19/Aug/2020
  • Stars
    star
    146
  • Rank 251,289 (Top 5 %)
  • Language
    Python
  • License
    Other
  • Created over 9 years ago
  • Updated over 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ipython-spark-docker

This repo provides Docker containers to run:

  • Spark master and worker(s) for running Spark in standalone mode on dedicated hosts
  • Mesos-enhanced containers for Mesos-mastered Spark jobs
  • IPython web interface for interacting with Spark or Mesos master via PySpark

Please see the accompanying blog posts for the technical details and motivation behind this project:

Architecture

Docker containers provide a portable and repeatable method for deploying the cluster:

hadoop-docker-client connections

CDH5 Tools and Libraries

HDFS Hbase Hive Oozie Pig Hue

Python Packages and Modules

Pattern NLTK Pandas NumPy SciPy SymPy Seaborn
Cython Numba Biopython Rmagic 0MQ Matplotlib Scikit-Learn
Statsmodels Beautiful Soup NetworkX LLVM Bokeh Vincent MDP

Usage

Option 1. Mesos-mastered Spark Jobs

  1. Install Mesos with Docker Containerizer and Docker Images: Install a Mesos cluster configured to use the Docker containerizer, which enables the Mesos slaves to execute Spark tasks within a Docker container.

    A. End-to-end Installation: The script mesos/1-setup-mesos-cluster.sh uses the Python library Fabric to install and configure a cluster according to How To Configure a Production-Ready Mesosphere Cluster on Ubuntu 14.04. After installation, it also pulls the Docker images that will execute Spark tasks. To use:

    • Update IP Addresses of Mesos nodes in mesos/fabfile.py. Find instances to change with:
    grep 'ip-address' mesos/fabfile.py
    • Install/configure the cluster:
    ./mesos/1-setup-mesos-cluster.sh

    Optional: ./1-build.sh if you prefer instead to build the docker images from scratch (rather than the script pulling from Docker Hub)

    B. Manual Installation: Follow the general steps in mesos/1-setup-mesos-cluster.sh to manually install:

    • Install mesosphere on masters
    • Install mesos on slaves
    • Configure zookeeper on all nodes
    • Configure and start masters
    • Configure and start slaves
    • Load docker images:
      docker pull lab41/spark-mesos-dockerworker-ipython
      

docker pull lab41/spark-mesos-mesosworker-ipython

  1. Run the client container on a client host (replace 'username-for-sparkjobs' and 'mesos-master-fqdn' below):
    ./5-run-spark-mesos-dockerworker-ipython.sh username-for-sparkjobs mesos://mesos-master-fqdn:5050
    *Note: the client container will create username-for-sparkjobs when started, providing the ability to submit Spark jobs as a specific user and/or deploy different IPython servers for different users.

Option 2. Spark Standalone Mode

Installation and Deployment - Build each Docker image and run each on separate dedicated hosts

Tip: Build a common/shared host image with all necessary configurations and pre-built containers, which you can then use to deploy each node. When starting each node, you can pass the container run scripts as User data to initialize that container at boot time
  1. Prerequisites
  • Deploy Hadoop/HDFS cluster. Spark uses a cluster to distrubute analysis of data pulled from multiple sources, including the Hadoop Distrubuted File System (HDFS). The ephemeral nature of Docker containers make them ill-suited for persisting long-term data in a cluster. Instead of attempting to store data within the Docker containers' HDFS nodes or mounting host volumes, it is recommended you point this cluster at an external Hadoop deployment. Cloudera provides complete resources for installing and configuring its distribution (CDH) of Hadoop. This repo has been tested using CDH5.
  1. Build and configure hosts

  2. Install Docker v1.5+, jq JSON processor, and iptables. For example, on an Ubuntu host:

    ./0-prepare-host.sh

  3. Update the Hadoop configuration files in runtime/cdh5/<hadoop|hive>/<multiple-files> with the correct hostnames for your Hadoop cluster. Use grep FIXME -R . to find hostnames to change.

  4. Generate new SSH keypair (dockerfiles/base/lab41/spark-base/config/ssh/id_rsa and dockerfiles/base/lab41/spark-base/config/ssh/id_rsa.pub), adding the public key to dockerfiles/base/lab41/spark-base/config/ssh/authorized_keys.

  5. (optional) Update SPARK_WORKER_CONFIG environment variable for Spark-specific options such as executor cores. Update the variable via a shell export command or by updating dockerfiles/standalone/lab41/spark-client-ipython/config/service/ipython/run.

  6. (optional) Comment out any unwanted Python packages in the base Dockerfile image dockerfiles/base/lab41/python-datatools/Dockerfile.

  7. Get Docker images:

Option A: If you prefer to pull from Docker Hub:
docker pull lab41/spark-master
docker pull lab41/spark-worker
docker pull lab41/spark-client-ipython
Option B: If you prefer to build from scratch yourself:
./1-build.sh
If you are creating common/shared host images, this would be the point to snapshot the host image for replication.
  1. Deploy cluster nodes
Ensure each host has a Fully-Qualified-Domain-Name (i.e. master.domain.com; worker1.domain.com; ipython.domain.com) for the Spark nodes to properly associate
1. Run the master container on the master host:
./2-run-spark-master.sh
2. Run worker container(s) on worker host(s) (replace 'spark-master-fqdn' below):
./3-run-spark-worker.sh spark://spark-master-fqdn:7077
3. Run the client container on the client host (replace 'spark-master-fqdn' below):
./4-run-spark-client-ipython.sh spark://spark-master-fqdn:7077

More Repositories

1

sunny-side-up

Sentiment Analysis Challenge
Jupyter Notebook
521
star
2

PySEAL

This repository is a fork of Microsoft Research's homomorphic encryption implementation, the Simple Encrypted Arithmetic Library (SEAL). This code wraps the SEAL build in a docker container and provides Python API's to the encryption library.
C++
225
star
3

hermes

Recommender System Framework
Jupyter Notebook
124
star
4

cyphercat

Implementation of membership inference and model inversion attacks, extracting training data information from an ML model. Benchmarking attacks and defenses.
Jupyter Notebook
98
star
5

Dendrite

People. Places. Things. Graphs.
JavaScript
92
star
6

attalos

Joint Vector Spaces
Jupyter Notebook
89
star
7

Circulo

Community Detection Research Effort
Python
79
star
8

pythia

Supervised learning for novelty detection in text
Jupyter Notebook
79
star
9

survey-community-detection

Market Survey: Community Detection
70
star
10

Magnolia

Jupyter Notebook
45
star
11

pelops

The Pelops car re-ID project
Jupyter Notebook
44
star
12

altair

Assessing Source Code Semantic Similarity with Unsupervised Learning
Python
41
star
13

SkyLine

An Exploration into Graph Databases
Python
28
star
14

soft-boiled

Library for Geo-Inferencing in Twitter Data
Python
28
star
15

magichour

Security log file challenge
Jupyter Notebook
28
star
16

Redwood

A project that implements statistical methods for identifying anomalous files
Python
22
star
17

gestalt

Data storytelling. See link for detailed documentations: http://lab41.github.io/gestalt.
JavaScript
20
star
18

d-script

Writer Identification of Handwritten Documents
Jupyter Notebook
13
star
19

Misc

Miscellaneous utility functions
Jupyter Notebook
11
star
20

graph-generators

Scripts for generating graphs in various formats.
Python
11
star
21

lab41.github.com

Lab41 Blog
HTML
10
star
22

etl-by-example

Java
10
star
23

VOiCES-subset

VOiCES-subset
Jupyter Notebook
9
star
24

MRKronecker

MRKronecker
Java
7
star
25

graphlab-twill

Java
7
star
26

Hemlock

Hemlock is a way of providing a common data access layer.
JavaScript
7
star
27

try41

try41 - a demonstration platform
CSS
7
star
28

Summer2018ML

This repository educates users on the basics of machine learning, from basic linear algebra to backward propagation.
Jupyter Notebook
7
star
29

Rio

gephi <3 blueprints
Java
7
star
30

Hemlock-Frontend

Rails frontend for Hemlock
Ruby
4
star
31

ganymede_nbextension

Ganymede logging extension for the Jupyter Notebook Server
Python
4
star
32

verboten_words

pre-commit hook searches for words you do not want in your repo.
Python
3
star
33

Hemlock-REST

RESTful server for Lab41/Hemlock
Python
3
star
34

Epiphyte

Code for bulk loading data into Titan
Java
3
star
35

Blogs

Code that is relevant to our blog posts.
MATLAB
2
star
36

Papers

Lab41 Submitted Academic Paper
2
star
37

nbhub

Python
2
star
38

hadoop-dev-env

Shell
1
star
39

mediumblog

1
star
40

reading-group-generation-1

Reading group summaries and resources
1
star
41

condo

🌇 Simulated codon optimized CDS dataset
Jupyter Notebook
1
star
42

titan-python-tutorial

Python
1
star