• Stars
    star
    156
  • Rank 239,589 (Top 5 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 6 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Operator for managing the Spark clusters on Kubernetes and OpenShift.

spark-operator

Build status License

{CRD|ConfigMap}-based approach for managing the Spark clusters in Kubernetes and OpenShift.

Watch the full asciicast

How does it work

UML diagram

Quick Start

Run the spark-operator deployment: Remember to change the namespace variable for the ClusterRoleBinding before doing this step

kubectl apply -f manifest/operator.yaml

Create new cluster from the prepared example:

kubectl apply -f examples/cluster.yaml

After issuing the commands above, you should be able to see a new Spark cluster running in the current namespace.

kubectl get pods
NAME                               READY     STATUS    RESTARTS   AGE
my-spark-cluster-m-5kjtj           1/1       Running   0          10s
my-spark-cluster-w-m8knz           1/1       Running   0          10s
my-spark-cluster-w-vg9k2           1/1       Running   0          10s
spark-operator-510388731-852b2     1/1       Running   0          27s

Once you don't need the cluster anymore, you can delete it by deleting the custom resource by:

kubectl delete sparkcluster my-spark-cluster

Very Quick Start

# create operator
kubectl apply -f http://bit.ly/sparkop

# create cluster
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: SparkCluster
metadata:
  name: my-cluster
spec:
  worker:
    instances: "2"
EOF

Limits and requests for cpu and memory in SparkCluster pods

The operator supports multiple fields for setting limit and request values for master and worker pods. You can see these being used in the examples/test directory.

  • cpu and memory specify both limit and request values for cpu and memory (that is, limits and requests will be equal) This was the first mechanism provided for setting limits and requests and has been retained for backward compatibility. However, a need was found to be able to set the requests and limits individually.

  • cpuRequest and memoryRequest set request values and take precedence over values from cpu and memory respectively

  • cpuLimit and memoryLimit set limit values and take precedence over values taken from cpu and memory respectively

Node Tolerations for SparkCluster pods

The operator supports specifying Kubernetes node tolerations which will be applied to all master and worker pods in a Spark cluster. You can see examples of this in use in the examples/test directory.

  • nodeTolerations specifies a list of Node Tolerations definitions that should be applied to all master and worker nodes.

Spark Applications

Apart from managing clusters with Apache Spark, this operator can also manage Spark applications similarly as the GoogleCloudPlatform/spark-on-k8s-operator. These applications spawn their own Spark cluster for their needs and it uses the Kubernetes as the native scheduling mechanism for Spark. For more details, consult the Spark docs.

# create spark application
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: SparkApplication
metadata:
  name: my-cluster
spec:
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
  mainClass: org.apache.spark.examples.SparkPi
EOF

OpenShift

For deployment on OpenShift use the same commands as above (with oc instead of kubectl if kubectl is not installed) and make sure the logged user can create CRDs: oc login -u system:admin && oc project default

Config Map approach

This operator can also work with Config Maps instead of CRDs. This can be useful in situations when user is not allowed to create CRDs or ClusterRoleBinding resources. The schema for config maps is almost identical to custom resources and you can check the examples.

kubectl apply -f manifest/operator-cm.yaml

The manifest above is almost the same as the operator.yaml. If the environmental variable CRD is set to false, the operator will watch on config maps with certain labels.

You can then create the Spark clusters as usual by creating the config map (CM).

kubectl apply -f examples/cluster-cm.yaml
kubectl get cm -l radanalytics.io/kind=SparkCluster

or Spark applications that are natively scheduled on Spark clusters by:

kubectl apply -f examples/test/cm/app.yaml
kubectl get cm -l radanalytics.io/kind=SparkApplication

Images

Image name Description Layers quay.io docker.io
:latest-released represents the latest released version Layers info quay.io repo docker.io repo
:latest represents the master branch Layers info
:x.y.z one particular released version Layers info

For each variant there is also available an image with -alpine suffix based on Alpine for instance Layers info

Configuring the operator

The spark-operator contains several defaults that are implicit to the creation of Spark clusters and applications. Here are a list of environment variables that can be set to adjust the default behaviors of the operator.

  • CRD set to true if the operator should respond to Custom Resources, and set to false if it should respond to ConfigMaps.
  • DEFAULT_SPARK_CLUSTER_IMAGE a container image reference that will be used as a default for all pods in a SparkCluster deployment when the image is not specified in the cluster manifest.
  • DEFAULT_SPARK_APP_IMAGE a container image reference that will be used as a default for all executor pods in a SparkApplication deployment when the image is not specified in the application manifest.

Please note that these environment variables must be set in the operator's container, see operator.yaml and operator-cm.yaml for operator deployment information.

Related projects

If you are looking for tooling to make interacting with the spark-operator more convenient, please see the following.

  • Ansible role is a simple way to deploy the Spark operator using Ansible ecosystem. The role is available also in the Ansible Galaxy.

  • oshinko-temaki is a shell application for generating SparkCluster manifest definitions. It can produce full schema manifests from a few simple command line flags.

For checking and verifying that your own container image will work smoothly with the operator use the following tool.

  • soit is a CLI tool that runs a set of tests against the given image to verify if it contains the right files on the file system, if worker can register with master, etc. Check the code in the repository.

The radanalyticsio/spark-operator is not the only Kubernetes operator service that targets Apache Spark.

  • GoogleCloudPlatform/spark-on-k8s-operator is an operator which shares a similar schema for the Spark cluster and application resources. One major difference between it and the radanalyticsio/spark-operator is that the latter has been designed to work well in environments where a user has a limited role-based access to Kubernetes, such as on OpenShift and also that radanalyticsio/spark-operator can deploy standalone Spark clusters.

Operator Marketplace

If you would like to install the operator into OpenShift (since 4.1) using the Operator Marketplace, simply run:

cat <<EOF | kubectl apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorSource
metadata:
  name: radanalyticsio-operators
  namespace: openshift-marketplace
spec:
  type: appregistry
  endpoint: https://quay.io/cnr
  registryNamespace: radanalyticsio
  displayName: "Operators from radanalytics.io"
  publisher: "Jirka Kremser"
EOF

You will find the operator in the OpenShift web console under Catalog > OperatorHub (make sure the namespace is set to openshift-marketplace).

Troubleshooting

Show the log:

# last 25 log entries
kubectl logs --tail 25 -l app.kubernetes.io/name=spark-operator
# follow logs
kubectl logs -f `kubectl get pod -l app.kubernetes.io/name=spark-operator -o='jsonpath="{.items[0].metadata.name}"' | sed 's/"//g'`

Run the operator from your host (also possible with the debugger/profiler):

java -jar target/spark-operator-*.jar

More Repositories

1

openshift-spark

Shell
72
star
2

silex

something to help you spark
Scala
65
star
3

oshinko-webui

Web console for a spark cluster management app
JavaScript
28
star
4

streaming-amqp

AMQP data source for dstream (Spark Streaming)
Scala
26
star
5

oshinko-s2i

This is a place to put s2i images and utilities for spark application builders for openshift
Shell
15
star
6

streaming-lab

Artifacts and resources to support the streaming and event processing labs for radanalytics.io
Jupyter Notebook
14
star
7

oshinko-cli

Command line interface for spark cluster management app
Go
11
star
8

radanalyticsio.github.io

a developer oriented site for the radanalytics organization
CSS
10
star
9

tutorial-sparkpi-java-spring

A Java implementation of SparkPi using Spring Boot
Java
9
star
10

scorpion-stare

Spark scheduler backend plug-ins for awareness of kube, openshift, oshinko, etc
Scala
8
star
11

workshop

Materials for a workshop on deploying intelligent applications on OpenShift
Jupyter Notebook
7
star
12

oshinko-console

Oshinko Console Extensions
JavaScript
7
star
13

tensorflow-build-s2i

S2I image for building tensorflow binaries
Shell
5
star
14

tutorial-sparkpi-scala-akka

Scala
5
star
15

workshop-notebook

Basic Jupyter notebook for learning Spark and OpenShift
Jupyter Notebook
5
star
16

oshinko-rest

REST api for a spark cluster management app
Go
5
star
17

base-notebook

An image for running Jupyter notebooks and Apache Spark in the cloud on OpenShift
Shell
4
star
18

tensorflow-serving-s2i

S2I image is for running tensorflow model server on Openshift.
Shell
4
star
19

oshinko-specs

A place to keep specifications for features to be implemented
4
star
20

var-sandbox

Value at Risk sandbox scaffolding application
JavaScript
4
star
21

bad-apples

Java
3
star
22

tensorflow-serving-gpu-s2i

S2I image for running tensorflow_model_server on Openshift with GPU
Shell
3
star
23

oshinko-core

Standalone component for building oshinko spark clusters
Go
3
star
24

pyspark-s3-notebook

A simple Jupyter notebook to demonstrate techniques for connecting your application to data in s3
Jupyter Notebook
3
star
25

openshift-analytics-ansible

Shell
3
star
26

tutorial-sparkpi-python-flask

a Python implementation of SparkPi using Flask
Python
2
star
27

grafzahl

HTML
2
star
28

tutorial-sparkpi-java-vertx

Java
2
star
29

oshinko-oshizushi

an operator to wrap oshinko s2i mechanics
Dockerfile
2
star
30

jiminy-predictor

a predictor service for a spark based recommendation app
Python
2
star
31

oc-proxy

Runs 'oc proxy' in a container
Makefile
1
star
32

oshinko-broker

OpenServiceBrokerAPI implementation of Oshinko service
1
star
33

tensorflow-neural-style-gpu-s2i

An S2I builder for neural-style training and painting with Tensorflow on GPU!
Shell
1
star
34

kubespark-operator

Go
1
star
35

FinancialDataAnalysis

repo for python notebooks which show the analysis of financial data sets
Jupyter Notebook
1
star
36

jiminy-modeler

a framework for a spark based model creation service
Python
1
star
37

winemap-data-loader

This is a python s2i that loads the wine review data into a postgesql pod on openshift
Python
1
star
38

tensorflow-neural-style-s2i

An S2I builder for neural-style training and painting with Tensorflow!
Shell
1
star
39

equoid-openshift

Shell
1
star
40

openshift-test-kit

Shell
1
star
41

winemap

This is the app that calls the postgresql db to show a map of wine reviews
Shell
1
star