• Stars
    star
    109
  • Rank 319,077 (Top 7 %)
  • Language Makefile
  • License
    Apache License 2.0
  • Created almost 7 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Open MPI jobs on Kubernetes

kube-openmpi: Open MPI jobs on Kubernetes

kube-openmpi provides mainly two things:

  • Kubernetes manifest template (powered by Helm) to run open mpi jobs on kubernetes cluster. See chart directory for details.
  • base docker images on DockerHub to build your custom docker images. Currently we provide only ubuntu 16.04 based imaages. To support distributed deep learning workloads, we provides cuda based images, too. Supported tags are below:

Supported tags of kube-openmpi base images

  • Plain Ubuntu based: 2.1.2-16.04-0.7.0 / 0.7.0
    • naming convention: $(OPENMPI_VERSION)-$(UBUNTU_IMAGE_TAG)-$(KUBE_OPENMPI_VERSION)
      • $(UBUNTU_IMAGE_TAG) refers to tags of ubuntu
  • Cuda (with cuDNN7) based:
    • cuda8.0: 2.1.2-8.0-cudnn7-devel-ubuntu16.04-0.7.0 / 0.7.0-cuda8.0
    • cuda9.0: 2.1.2-9.0-cudnn7-devel-ubuntu16.04-0.7.0 / 0.7.0-cuda9.0
    • cuda9.1: 2.1.2-9.1-cudnn7-devel-ubuntu16.04-0.7.0 / 0.7.0-cuda9.1
    • naming convention is $(OPENMPI_VERSION)-$(CUDA_IMAGE_TAG)-$(KUBE_OPENMPI_VERSION)
    • see Dockerfile
  • Chainer, Cupy, ChainerMN image:
    • cuda8.0: 0.7.0-cuda8.0-nccl2.1.4-1-chainer4.0.0b4-chainermn1.2.0
    • cuda9.0: 0.7.0-cuda9.0-nccl2.1.15-1-chainer4.0.0b4-chainermn1.2.0
    • cuda9.1: 0.7.0-cuda9.1-nccl2.1.15-1-chainer4.0.0b4-chainermn1.2.0
    • naming convention is $(KUBE_OPENMPI_VERSION)-$(CUDA_VERSION)-nccl$(NCCL_CUDA80_PACKAGE_VERSION)-chainer$(CHAINER_VERSION)-chainermn$(CHAINER_MN_VERSION)
    • see Dockerfile.chainermn

Quick Start

Requirements

Generate ssh keys and edit configuration

# generate temporary key
$ ./gen-ssh-key.sh

# edit your values.yaml
$ $EDITOR values.yaml

Deploy

$ MPI_CLUSTER_NAME=__CHANGE_ME__
$ KUBE_NAMESPACE=__CHANGE_ME_
$ helm template chart --namespace $KUBE_NAMESPACE --name $MPI_CLUSTER_NAME -f values.yaml -f ssh-key.yaml | kubectl -n $KUBE_NAMESPACE create -f -

Run

# wait until $MPI_CLUSTER_NAME-master is ready
$ kubectl get -n $KUBE_NAMESPACE po $MPI_CLUSTER_NAME-master

# You can run mpiexec now via 'kubectl exec'!
# hostfile is automatically generated and located '/kube-openmpi/generated/hostfile'
$ kubectl -n $KUBE_NAMESPACE exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root \
  --hostfile /kube-openmpi/generated/hostfile \
  --display-map -n 4 -npernode 1 \
  sh -c 'echo $(hostname):hello'
 Data for JOB [43686,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: MPI_CLUSTER_NAME-worker-0        Num slots: 2    Max slots: 0    Num procs: 1
        Process OMPI jobid: [43686,1] App: 0 Process rank: 0 Bound: UNBOUND

 Data for node: MPI_CLUSTER_NAME-worker-1        Num slots: 2    Max slots: 0    Num procs: 1
        Process OMPI jobid: [43686,1] App: 0 Process rank: 1 Bound: UNBOUND

 Data for node: MPI_CLUSTER_NAME-worker-2        Num slots: 2    Max slots: 0    Num procs: 1
        Process OMPI jobid: [43686,1] App: 0 Process rank: 2 Bound: UNBOUND

 Data for node: MPI_CLUSTER_NAME-worker-3        Num slots: 2    Max slots: 0    Num procs: 1
        Process OMPI jobid: [43686,1] App: 0 Process rank: 3 Bound: UNBOUND

 =============================================================
MPI_CLUSTER_NAME-worker-1:hello
MPI_CLUSTER_NAME-worker-2:hello
MPI_CLUSTER_NAME-worker-0:hello
MPI_CLUSTER_NAME-worker-3:hello

Scale Up/Down your cluster

MPI workers forms StatefulSets. So, you can scale up or down the cluster.

# scale workers from 4 to 3
$ kubectl -n $KUBE_NAMESPACE scale statefulsets $MPI_CLUSTER_NAME-worker --replicas=3
statefulset "MPI_CLUSTER_NAME-worker" scaled

# Then you can mpiexec again
# hostfile will be updated automatically every 15 seconds in default
$ kubectl -n $KUBE_NAMESPACE exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root \
  --hostfile /kube-openmpi/generated/hostfile \
  --display-map -n 3 -npernode 1 \
  sh -c 'echo $(hostname):hello'
...
MPI_CLUSTER_NAME-worker-0:hello
MPI_CLUSTER_NAME-worker-2:hello
MPI_CLUSTER_NAME-worker-1:hello

Tear Down

$ helm template chart --namespace $KUBE_NAMESPACE --name $MPI_CLUSTER_NAME -f values.yaml -f ssh-key.yaml | kubectl -n $KUBE_NAMESPACE delete -f -

Use your own custom docker image

please edit image section in values.yaml

image:
  repository: yourname/kube-openmpi-based-custom-image
  tag: latest

It expects that your custom image is based on our base image (everpeace/kube-openmpi) and does NOT change any ssh/sshd configurations define in image/Dockerfile on your custom image.

Please refer to Custom ChainerMN image example on kube-openmpi for details.

Pull an image from Private Registry

Please create a Secret of docker-registry type to your namespace by referring here.

And then, you can specify the secret name in your values.yaml:

image:
  repository: <your_registry>/<your_org>/<your_image_name>
  tag: <your_tag>
  pullSecrets:
  - name: <docker_registry_secret_name>

Inject your code to your containers from Github

kube-openmpi supports to import your codes hosted by github into your containers. To do it, please edit appCodesToSync section in values.yaml. You can define multiple github repositories.

appCodesToSync:
- name: your-app-name
  gitRepo: https://github.com/org/your-app-name.git
  gitBranch: master
  fetchWaitSecond: "120"
  mountPath: /repo

When Your Code In Private Repository

When your code are in private git repository. The secret repo must be able to access via ssh.

Please remember this feature requires securityContext.runAs: 0 for side-car containers fetching your code into mpi containers.

Step 1.

You need to register ssh key to the repo. I recommend you to set up Deploy Keys for your secret repo because it is valid only for the target repository and read-only.

Step 2.

Create generic type Secret which has a key ssh and its value is the private key.

$ kubectl create -n $KUBE_NAMESPACE secret generic <git-sync-cred-name> --from-file=ssh=<deploy-private-key-file>

Step 3.

Then, you can define appCodesToSync entries with the secret

- name: <your-secret-repo>
  gitRepo: git@<git-server>:<your-org>/<your-secret-repo>.git
  gitBranch: master
  fetchWaitSecond: "120"
  mountPath: <mount-point>
  gitSecretName: <git-sync-cred-name>

Run kube-openmpi cluster as non-root user

At default, kube-openmpi runs your mpi cluster as root user. However, from security standpoint, you might want to run your mpi-cluster as non-root user. There is two way to achieve this.

Use default openmpi user and group

kube-openmpi base docker images on DockerHub ships such normal user openmpi with uid=1000/gid=1000. To make the user run your mpi-cluster, edit your values.yaml to specify SecurityContext like below:

# values.yaml
...
mpiMaster:
  securityContext:
    runAsUser: 1000
    fsGroup: 1000
...
mpiWorkers:
  securityContext:
    runAsUser: 1000
    fsGroup: 1000

Then you can run mpiexec as openmpi user. You would need to tear down and re-deploy your mpi-cluster if you had kube-openmpi cluster already.

$ kubectl -n $KUBE_NAMESPACE exec -it $MPI_CLUSTER_NAME-master -- mpiexec \
  --hostfile /kube-openmpi/generated/hostfile \
  --display-map -n 4 -npernode 1 \
  sh -c 'echo $(hostname):hello'
...

Use your own custom user with custom uid/gid

You need to build your own custom base image because the custom user with your desired uid/gid must exists(embedded) in the docker image. To do this, just run make with several options below.

$ cd images
$ make REPOSITORY=<your_org>/<your_repo> SSH_USER=<username> SSH_UID=<uid> SSH_GID=<gid>

This creates ubuntu based image, cuda8(cudnn7) image and cuda9(cudnn7) image.

And then, set the image in your values.yaml and set your uid/gid to runAsUser/fsGroup as the previous section.

How to use gang-scheduling (i.e. schedule a group of pods at once)

As stated kubeflow/tf-operator#165, spawning multiple kube-openmpi cluster causes deadlock. To prevent it, you might want gang-scheduling (i.e schedule multiple pods all together) in kubernetes. Currently, kubernetes-incubator/kube-arbitrator support it by using kube-batchd scheduler and PodDisruptionBudget.

Please follow the steps:

  1. deploy kube-batchd scheduler

  2. Edit mpiWorkers.customScheduling section in your values.yaml like this.

    mpiWorkers:
      customScheduling:
        enabled: true
        schedulerName: <your_kube-batchd_scheduler_name>
        podDisruptionBudget:
          enabled: true
    
  3. Deploy your kube-openmpi cluster.

Run ChainerMN Job

We published Chainer,ChainerMN(with CuPy and NCCL2) based image. Let's use it. In this example, we run train_mnist example in ChainerMN repo. If you wanted to build your own docker image. Please refer to Custom ChainerMN image example on kube-openmpi for details.

  1. edit your values.yaml so that
  • kube-openmpi uses the image.
  • allocate 2 mpi workers and assign 1 GPU resource to each mpi worker.
  • add appCodesToSync entry to run train_mnist example with ChainerMN.
image:
  repository: everpeace/kube-openmpi
  tag: 0.7.0-cuda8.0-nccl2.1.4-1-chainer4.0.0b4-chainermn1.2.0
...
mpiWorkers:
  num: 2
  resources:
    limits:
      nvidia.com/gpu: 1
...
appCodesToSync:
- name: chainermn
  gitRepo: https://github.com/chainer/chainermn.git
  gitBranch: master
  fetchWaitSecond: "120"
  mountPath: /chainermn-examples
  subPath: chainermn/examples
...
  1. Deploy your kube-openmpi cluster
$ MPI_CLUSTER_NAME=__CHANGE_ME__
$ KUBE_NAMESPACE=__CHANGE_ME_
$ helm template chart --namespace $KUBE_NAMESPACE --name $MPI_CLUSTER_NAME -f values.yaml -f ssh-key.yaml | kubectl -n $KUBE_NAMESPACE create -f -
  1. Run train_mnist with GPU
$ kubectl -n $KUBE_NAMESPACE exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root \
  --hostfile /kube-openmpi/generated/hostfile \
  --display-map -n 2 -npernode 1 \
  python3 /chainermn-examples/mnist/train_mnist.py -g
========================   JOB MAP   ========================

Data for node: MPI_CLUSTER_NAME-worker-0  Num slots: 8    Max slots: 0    Num procs: 1
       Process OMPI jobid: [28697,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../..][../../../..]

Data for node: MPI_CLUSTER_NAME-worker-1  Num slots: 8    Max slots: 0    Num procs: 1
       Process OMPI jobid: [28697,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../..][../../../..]

=============================================================
==========================================
Num process (COMM_WORLD): 2
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================
...
1           0.224002    0.102322              0.9335         0.9695                    17.1341
2           0.0733692   0.0672879             0.977967       0.9765                    24.7188
...
20          0.00531046  0.105093              0.998267       0.9799                    160.794

Release Notes

0.7.0

  • docker base images:
    • fix init.sh so that non-root user won't fail to run init.sh
  • kubernetes manifests:
    • add master pod to compute nodes. now openmpi jobs can run in master pod. This enables users to use single-node openmpi jobs.

0.6.0

  • docker base images:
    • CMD was changed from start_sshd.sh to init.sh. When ONE_SHOT was true, init.sh will execute user command which as passed an arguments to init.sh just after sshd was up.
  • kubernetes manifests:
    • oneShot mode is supported. Auto scale down workers feature is also supported.
      • In mpiMaster.oneShot mode, mpiMaster.oneShot.command will be automatically executed in master once cluster was up. if mpiMaster.oneShot.autoScaleDownWorkers was enabled and mpiMaster.oneShot.command successfully completed (i.e. return code was 0), worker cluster will be scaled down to 0.

0.5.3

  • docker base images
    • cuda9.0 support added.
    • ChainerMN images for each cuda versions(8.0, 9.0, 9.1)
  • kubernetes manifests:
    • supported docker registry secret to pull docker images from private docker registry
    • supported fetching codes from private git repositories

0.5.2

  • kubernetes manifests:
    • For preventing potential deadlock when scheduling multiple kube-openmpi clusters, gang-scheduling (schedule a group of pods all together) for mpi workers is now available via kube-batchd in kube-arbitrator.

0.5.1

  • kubernetes manifests:
    • support user defined volumes/volumeMounts
    • kube-openmpi managed volume names changed.
  • Documents
    • make Run step simpler. Changed to use kubectl exec -it -- mpiexec directly.

0.5.0

  • docker images:
    • root can ssh to both mpi-master and mpi-workers when containers run as root
  • kubernetes manifests:
    • now mpi cluster runs as root at default
    • you can use openmpi user as before by setting runAsUser/fsGroup in values.yaml
    • you don't need to dig a tunnel to use mpiexec command!
    • documented how to use your custom user with custom uid/gid

0.4.0

  • docker images:
    • added orte_keep_fqdn_hostnames=t to openmpi-mca-params.conf
  • kubernetes manifests:
    • now you don't need CustomPodDNS feature gate!!
    • bootstrap job was removed
    • hostfile-updater was introduced. Now you can scale up/down your mpi cluster dynamically!
      • It runs next to mpi-master pod as a side-car container.
    • The path of auto generated hostfile was moved to /kube-openmpi/generated/hostfile

0.3.0

  • docker images:
    • removed s6-overlay init process and introduced self-managed sshd script to support securityContext (e.g. securityContext.runAs) (#1).
  • kubernetes manifests:
    • supported custom securityContext (#1)
    • improved mpi-cluster cleanup process
    • fixed broken network-policy maniefst

0.2.0

  • docker images:
    • fixed cuda-aware openMPI installation script. added ensure mca:mpi:base:param:mpi_built_with_cuda_support:value:true when cuda based image was built. You can NOT use open MPI with CUDA on 0.1.0. So, please use 0.2.0.
  • kubernetes manifests:
    • fixed resources in values.yaml was ignored.
    • now workers can resolve master in DNS.

0.1.0

  • initial release

TODO

  • automate the process (create kube-openmpi commnd?)
  • document chart parameters
  • add additional persistent volume claims

More Repositories

1

vagrant-mesos

Spin up your Mesos Cluster with Vagrant! (VirtualBox and AWS)
Ruby
435
star
2

ml-class-assignments

Programming Exercises on http://ml-class.org
MATLAB
427
star
3

k8s-scheduler-extender-example

An example of kubernetes scheduler extender
Go
177
star
4

cookbook-mesos

Cookbook for Mesos (http://mesos.apache.org/).
Ruby
74
star
5

concourse-gitlab-flow

concourse pipeline sample of environment branches with GitLab flow branching model
Shell
62
star
6

k8s-host-device-plugin

very thin kubernetes device plugin which just exposes device files in host to containers.
Go
47
star
7

kafka-reassign-optimizer

Kafka Partitions Re-Assignment Optimizer in scala
Scala
45
star
8

ml-examples-by-scalala

Machine Learning Algorithms Samples By Scalala
Scala
35
star
9

kube-throttler

throttling your pods in kubernetes cluster.
Go
32
star
10

programming-erlang-code

erlang codes in "Programming Erlang" http://pragprog.com/book/jaerlang/programming-erlang
Erlang
21
star
11

healthchecks

tiny healthcheck library for akka-http with Kubernetes liveness/readiness probe support
Scala
20
star
12

packer-mesos

Bake your own Mesos(http://mesos.apache.org) pre-installed virtual machine images.
Ruby
19
star
13

rxscalaz

some useful type class instances for Observable
Scala
13
star
14

dbt-models-metadata

Extension package for dbt to build a metadata table for your dbt models along side your models.
Makefile
12
star
15

banditsbook-scala

Scala implementations of standard algorithms for Multi-Armed Bandits Problem.
Scala
12
star
16

ring-benchmark-in-akka

This is another implementation in Akka of http://github.com/everpeace/ring-benchmark .
Scala
10
star
17

composing-monads

monads can compose when distributive law exists.
Scala
10
star
18

helm-charts

my public helm chart repository
Smarty
9
star
19

scalamata

Automata in Scala
Scala
9
star
20

string-score

string-score is a port of Joshaven Potter's string_score to Java.
Java
8
star
21

concourse-github-flow

sample pipeline of concourse ci for projects applying github-flow
Shell
7
star
22

constructr-redis

This library enables to use Redis as cluster coordinator in a ConstructR based cluster
Scala
6
star
23

dotfiles

dotfiles
Shell
6
star
24

aws-kms-resource

Concourse CI resource for decrypting your secrets by AWS Key Management Service
Shell
6
star
25

asdf-docker-slim

docker-slim plugin for asdf version manager
Shell
6
star
26

docker-chainer

all-in-one chainer docker image for instant distributed machine learning (chainer/chainermn/CUPY/CUDA/CuDNN/NCCL2/OpenMPI)
Shell
5
star
27

scala-galois

Galois Field Arithmetic Library in Scala.
Scala
5
star
28

kube-zookeeper

Makefile
5
star
29

bk-tree

implementation of bk-tree, which provides effective search in a given metric space.
Java
5
star
30

go-actor

far far incomplete actor implementation in golang. This is only for my golang learning.
Go
5
star
31

ring-benchmark

This is my solution for the "Ring Benchmark" exercise in "Programming in Erlang".
Erlang
5
star
32

observable-canbe-monad

try to verify Observable, in RxJava, can be a Monad.
Scala
4
star
33

minwise-lsh

an implementation of locality sensitive hash using min-wise permutation family.
Java
4
star
34

scala-bikleisli

BiKleisli Arrow in Scala using Scalaz
Scala
4
star
35

ts-analysis-by-R

private reading and exercise notes of "Rによる時系列分析入門"
R
4
star
36

bloom-filter

This is an implementation of Bloom Filter, especially CountingBloomFilter.
Scala
4
star
37

word2vec-jawiki

Tool to build word embeddings with word2vec from japanese wikipedia dump data
Shell
4
star
38

ring_benchmark_in_elixir

This is another implementation in Elixir of http://github.com/everpeace/ring-benchmark
Elixir
3
star
39

k8s-leader-elector

Go
3
star
40

docker-curl-jq

ubuntu-slim based curl + jq box
Dockerfile
3
star
41

kubectl-pecologs

'kubectl logs' for multiple pods/containers + peco
Shell
2
star
42

CommonRegexScala

CommonRegex port for Scala
Scala
2
star
43

scaldingla

linear algebra algorithms in scalding
Scala
2
star
44

faceted-values

faceted values: a strong primitive for privacy sensitive values
Haskell
2
star
45

maven-repository

everpeace personal maven repository
1
star
46

vagrant-cassandra

vagrant configuration for cassandra cluster
Ruby
1
star
47

gitsshm

gitsshm: GIT_SSH Manager.
1
star
48

akka-exp

custom load balancer by akka
Scala
1
star
49

easymock-junit4-rule

JUnit4's rule injecting EasyMock objects using annotations.
1
star
50

k8s-dumb-device-plugin

learning k8s device plugin
Go
1
star
51

mesos-driver-enters-zombie

This can reproduce the issue that MesosSchedulerDriver enters zombie state even after it stopped normally.
Scala
1
star
52

field-type-resolver

a utility which resolve an actual type of a given field in a given class hierarchy.
1
star
53

throttolable-perf-consumer

Scala
1
star
54

dining-philosophers

dining philosophers on alloy
1
star
55

chainer-operator-proto

prototype of chainer-operator for kubernetes
Shell
1
star
56

merkle-tree

[UNDER CONSTRUCTION] This is an implementation of Merkle-Tree.
1
star
57

monad-from-applicative

Verifying that Traversable and Monoid Functor can be a Mond.
Haskell
1
star
58

homebrew-ssh-agent-filter

ssh-agent-filter homebrew tap repository for macOS.
Ruby
1
star