• Stars
    star
    2,036
  • Rank 22,724 (Top 0.5 %)
  • Language
    Shell
  • Created about 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Apache Spark docker image

Gitter chat Build Status Twitter

Spark docker

Docker images to:

  • Setup a standalone Apache Spark cluster running one Spark Master and multiple Spark workers
  • Build Spark applications in Java, Scala or Python to run on a Spark cluster
Currently supported versions:
  • Spark 3.3.0 for Hadoop 3.3 with OpenJDK 8 and Scala 2.12
  • Spark 3.2.1 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.2.0 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.1.2 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.1.1 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.1.1 for Hadoop 3.2 with OpenJDK 11 and Scala 2.12
  • Spark 3.0.2 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.0.1 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.0.0 for Hadoop 3.2 with OpenJDK 11 and Scala 2.12
  • Spark 3.0.0 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 2.4.5 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.4.4 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.4.3 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.4.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.4.0 for Hadoop 2.8 with OpenJDK 8 and Scala 2.12
  • Spark 2.4.0 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.3.2 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.3.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.3.1 for Hadoop 2.8 with OpenJDK 8
  • Spark 2.3.0 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.2.2 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.2.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.2.0 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.1.3 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.1.2 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.1.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.1.0 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.0.2 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.0.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 8
  • Spark 2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 7
  • Spark 1.6.2 for Hadoop 2.6 and later
  • Spark 1.5.1 for Hadoop 2.6 and later

Using Docker Compose

Add the following services to your docker-compose.yml to integrate a Spark master and Spark worker in your BDE pipeline:

version: '3'
services:
  spark-master:
    image: bde2020/spark-master:3.3.0-hadoop3.3
    container_name: spark-master
    ports:
      - "8080:8080"
      - "7077:7077"
    environment:
      - INIT_DAEMON_STEP=setup_spark
  spark-worker-1:
    image: bde2020/spark-worker:3.3.0-hadoop3.3
    container_name: spark-worker-1
    depends_on:
      - spark-master
    ports:
      - "8081:8081"
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
  spark-worker-2:
    image: bde2020/spark-worker:3.3.0-hadoop3.3
    container_name: spark-worker-2
    depends_on:
      - spark-master
    ports:
      - "8082:8081"
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
  spark-history-server:
      image: bde2020/spark-history-server:3.3.0-hadoop3.3
      container_name: spark-history-server
      depends_on:
        - spark-master
      ports:
        - "18081:18081"
      volumes:
        - /tmp/spark-events-local:/tmp/spark-events

Make sure to fill in the INIT_DAEMON_STEP as configured in your pipeline.

Running Docker containers without the init daemon

Spark Master

To start a Spark master:

docker run --name spark-master -h spark-master -d bde2020/spark-master:3.3.0-hadoop3.3

Spark Worker

To start a Spark worker:

docker run --name spark-worker-1 --link spark-master:spark-master -d bde2020/spark-worker:3.3.0-hadoop3.3

Launch a Spark application

Building and running your Spark application on top of the Spark cluster is as simple as extending a template Docker image. Check the template's README for further documentation.

Kubernetes deployment

The BDE Spark images can also be used in a Kubernetes enviroment.

To deploy a simple Spark standalone cluster issue

kubectl apply -f https://raw.githubusercontent.com/big-data-europe/docker-spark/master/k8s-spark-cluster.yaml

This will setup a Spark standalone cluster with one master and a worker on every available node using the default namespace and resources. The master is reachable in the same namespace at spark://spark-master:7077. It will also setup a headless service so spark clients can be reachable from the workers using hostname spark-client.

Then to use spark-shell issue

kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:3.3.0-hadoop3.3 -- bash ./spark/bin/spark-shell --master spark://spark-master:7077 --conf spark.driver.host=spark-client

To use spark-submit issue for example

kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:3.3.0-hadoop3.3 -- bash ./spark/bin/spark-submit --class CLASS_TO_RUN --master spark://spark-master:7077 --deploy-mode client --conf spark.driver.host=spark-client URL_TO_YOUR_APP

You can use your own image packed with Spark and your application but when deployed it must be reachable from the workers. One way to achieve this is by creating a headless service for your pod and then use --conf spark.driver.host=YOUR_HEADLESS_SERVICE whenever you submit your application.

More Repositories

1

docker-hadoop

Apache Hadoop docker image
Shell
2,196
star
2

docker-hive

Shell
1,020
star
3

docker-hadoop-spark-workbench

[EXPERIMENTAL] This repo includes deployment instructions for running HDFS/Spark inside docker containers. Also includes spark-notebook and HDFS FileBrowser.
Makefile
688
star
4

docker-hbase

Makefile
246
star
5

docker-flink

Apache Flink docker image
Shell
191
star
6

README

General README for the Big Data Europe project's sources
83
star
7

demo-spark-sensor-data

Demo Spark application to transform data gathered on sensors for a heatmap application
Java
33
star
8

docker-kafka

Shell
31
star
9

docker-hive-metastore-postgresql

Postgresql configured to work as metastore for Hive.
TSQL
30
star
10

app-bde-pipeline

Bootstrap a pipeline on the BDE platform
Elixir
26
star
11

docker-zeppelin

Makefile
25
star
12

docker-hdfs-filebrowser

A docker image for HDFS FileBrowser. Cloudera Hue with FileBrowser only.
Mako
11
star
13

docker-spark-notebook

Spark Notebook docker image
Makefile
10
star
14

docker-flume

Python
8
star
15

docker-zookeeper

[DEPRECATED]
Shell
8
star
16

docker-elasticsearch

Start Elasticsearch instance, initiate an index and submit the index schema (mappings)
Shell
8
star
17

app-bdi-ide

Common Lisp
7
star
18

WorkFlow-Builder

Application to build and export Big Data pipelines
Elixir
7
star
19

demo-integrator-ui

Showcase the demo for integrator UI with Hadoop, HDFS browser, Spark, Flink, Strabon, Sextant, Solr.
Shell
6
star
20

docker-ontario

Ontario: Ontology-based Architecture for Semantic Data Lakes
5
star
21

app-integrator-ui

Wrapping user interface for embedding pipeline component interfaces
JavaScript
5
star
22

app-stack-builder

Application which helps in the construction of docker-compose.yml files
Common Lisp
4
star
23

mu-init-daemon-service

Microservice to report the progress of a service's initialization process
Ruby
4
star
24

docker-event-detection

Shell
4
star
25

docker-strabon

Shell
4
star
26

pilot-sc6-cycle2

Shell
3
star
27

mu-swarm-admin-service

A microservice that allows BDE pipelines to be managed through a graph database
Python
3
star
28

app-http-logger

Logging system to observe running containers, inspect their traffic and make it available for visualization in ElasticSearch
Shell
3
star
29

graph-acl-basics

Testing environment for graph-based ACL using the Mu Query Rewriter
Common Lisp
2
star
30

docker-postgres

Dockerized postgres
Shell
2
star
31

vagrant-mesos-multinode

[DEPRECATED] Boot Mesos with Vagrant
Shell
2
star
32

pilot-sc7-change-detector

Java
2
star
33

mu-query-rewriter

Scheme
2
star
34

app-swarm-ui

Swarm User Interface based on docker-compose, mu.semte.ch and EmberJS
Common Lisp
2
star
35

ember-stack-builder-frontend

Frontend for the Stack Builder
JavaScript
2
star
36

demo-d3js-with-sparqlendpoint

JavaScript
2
star
37

docker-nginx-proxy-with-css

Nginx proxy topping pages with a BDE CSS style
CSS
2
star
38

docker-elk-stack

ELK stack Dockers for BDE pipelines
2
star
39

docker-4store

Shell
2
star
40

mu-event-query-service

Microservice to query a DB for docker container events and return information in json format.
Python
1
star
41

pilot-sc2-cycle1

Scala
1
star
42

mu-swarm-admin-proxy

The entrypoint of all pipelines
1
star
43

docker-solr

1
star
44

WorkFlow-Monitor

Ember frontend to monitor a BDE pipeline
JavaScript
1
star
45

mu-swarm-logger-service

Writes docker logs into the triplestore and/or into files
Python
1
star
46

docker-kafkasail

1
star
47

vagrant-hadoop-singlenode

[DEPRECATED] Boot Hadoop with Vagrant
Shell
1
star
48

docker-geotriples-ws

1
star
49

mu-docker-stats

Microservice to fetch statistics data about the running containers to show it in the frontend for visual feedback.
Python
1
star
50

mu-query-rewriter-sandbox

A sandbox application that allows people to check the query rewriter
JavaScript
1
star
51

pilot-sc7-geotriples

Java
1
star
52

mu-pipeline-service

Provides resources to describe a Big Data pipeline in mu.semte.ch
Common Lisp
1
star
53

mu-har-transformation-service

Transforms each pcap file in a given directory into .har files (json) and pushes them into an ELK instance
Python
1
star
54

docker-kibana

Extended Kibana docker image with several plugins installed by default
1
star