spark-operator
{CRD|ConfigMap}
-based approach for managing the Spark clusters in Kubernetes and OpenShift.
How does it work
Quick Start
Run the spark-operator
deployment: Remember to change the namespace
variable for the ClusterRoleBinding
before doing this step
kubectl apply -f manifest/operator.yaml
Create new cluster from the prepared example:
kubectl apply -f examples/cluster.yaml
After issuing the commands above, you should be able to see a new Spark cluster running in the current namespace.
kubectl get pods
NAME READY STATUS RESTARTS AGE
my-spark-cluster-m-5kjtj 1/1 Running 0 10s
my-spark-cluster-w-m8knz 1/1 Running 0 10s
my-spark-cluster-w-vg9k2 1/1 Running 0 10s
spark-operator-510388731-852b2 1/1 Running 0 27s
Once you don't need the cluster anymore, you can delete it by deleting the custom resource by:
kubectl delete sparkcluster my-spark-cluster
Very Quick Start
# create operator
kubectl apply -f http://bit.ly/sparkop
# create cluster
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: SparkCluster
metadata:
name: my-cluster
spec:
worker:
instances: "2"
EOF
Limits and requests for cpu and memory in SparkCluster pods
The operator supports multiple fields for setting limit and request values for master and worker pods. You can see these being used in the examples/test directory.
-
cpu and memory specify both limit and request values for cpu and memory (that is, limits and requests will be equal) This was the first mechanism provided for setting limits and requests and has been retained for backward compatibility. However, a need was found to be able to set the requests and limits individually.
-
cpuRequest and memoryRequest set request values and take precedence over values from cpu and memory respectively
-
cpuLimit and memoryLimit set limit values and take precedence over values taken from cpu and memory respectively
Node Tolerations for SparkCluster pods
The operator supports specifying Kubernetes node tolerations which will be applied to all master and worker pods in a Spark cluster. You can see examples of this in use in the examples/test directory.
- nodeTolerations specifies a list of Node Tolerations definitions that should be applied to all master and worker nodes.
Spark Applications
Apart from managing clusters with Apache Spark, this operator can also manage Spark applications similarly as the GoogleCloudPlatform/spark-on-k8s-operator
. These applications spawn their own Spark cluster for their needs and it uses the Kubernetes as the native scheduling mechanism for Spark. For more details, consult the Spark docs.
# create spark application
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: SparkApplication
metadata:
name: my-cluster
spec:
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
mainClass: org.apache.spark.examples.SparkPi
EOF
OpenShift
For deployment on OpenShift use the same commands as above (with oc
instead of kubectl
if kubectl
is not installed) and make sure the logged user can create CRDs: oc login -u system:admin && oc project default
Config Map approach
This operator can also work with Config Maps instead of CRDs. This can be useful in situations when user is not allowed to create CRDs or ClusterRoleBinding
resources. The schema for config maps is almost identical to custom resources and you can check the examples.
kubectl apply -f manifest/operator-cm.yaml
The manifest above is almost the same as the operator.yaml. If the environmental variable CRD
is set to false
, the operator will watch on config maps with certain labels.
You can then create the Spark clusters as usual by creating the config map (CM).
kubectl apply -f examples/cluster-cm.yaml
kubectl get cm -l radanalytics.io/kind=SparkCluster
or Spark applications that are natively scheduled on Spark clusters by:
kubectl apply -f examples/test/cm/app.yaml
kubectl get cm -l radanalytics.io/kind=SparkApplication
Images
Image name | Description | Layers | quay.io | docker.io |
---|---|---|---|---|
:latest-released |
represents the latest released version | |||
:latest |
represents the master branch | |||
:x.y.z |
one particular released version |
For each variant there is also available an image with -alpine
suffix based on Alpine for instance
Configuring the operator
The spark-operator contains several defaults that are implicit to the creation of Spark clusters and applications. Here are a list of environment variables that can be set to adjust the default behaviors of the operator.
CRD
set totrue
if the operator should respond to Custom Resources, and set tofalse
if it should respond to ConfigMaps.DEFAULT_SPARK_CLUSTER_IMAGE
a container image reference that will be used as a default for all pods in aSparkCluster
deployment when the image is not specified in the cluster manifest.DEFAULT_SPARK_APP_IMAGE
a container image reference that will be used as a default for all executor pods in aSparkApplication
deployment when the image is not specified in the application manifest.
Please note that these environment variables must be set in the operator's container, see operator.yaml and operator-cm.yaml for operator deployment information.
Related projects
If you are looking for tooling to make interacting with the spark-operator more convenient, please see the following.
-
Ansible role is a simple way to deploy the Spark operator using Ansible ecosystem. The role is available also in the Ansible Galaxy.
-
oshinko-temaki is a shell application for generating
SparkCluster
manifest definitions. It can produce full schema manifests from a few simple command line flags.
For checking and verifying that your own container image will work smoothly with the operator use the following tool.
- soit is a CLI tool that runs a set of tests against the given image to verify if it contains the right files on the file system, if worker can register with master, etc. Check the code in the repository.
The radanalyticsio/spark-operator is not the only Kubernetes operator service that targets Apache Spark.
- GoogleCloudPlatform/spark-on-k8s-operator
is an operator which shares a similar schema for the Spark cluster and application
resources. One major difference between it and the
radanalyticsio/spark-operator
is that the latter has been designed to work well in environments where a user has a limited role-based access to Kubernetes, such as on OpenShift and also thatradanalyticsio/spark-operator
can deploy standalone Spark clusters.
Operator Marketplace
If you would like to install the operator into OpenShift (since 4.1) using the Operator Marketplace, simply run:
cat <<EOF | kubectl apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorSource
metadata:
name: radanalyticsio-operators
namespace: openshift-marketplace
spec:
type: appregistry
endpoint: https://quay.io/cnr
registryNamespace: radanalyticsio
displayName: "Operators from radanalytics.io"
publisher: "Jirka Kremser"
EOF
You will find the operator in the OpenShift web console under Catalog > OperatorHub
(make sure the namespace is set to openshift-marketplace
).
Troubleshooting
Show the log:
# last 25 log entries
kubectl logs --tail 25 -l app.kubernetes.io/name=spark-operator
# follow logs
kubectl logs -f `kubectl get pod -l app.kubernetes.io/name=spark-operator -o='jsonpath="{.items[0].metadata.name}"' | sed 's/"//g'`
Run the operator from your host (also possible with the debugger/profiler):
java -jar target/spark-operator-*.jar