• Stars
    star
    1,403
  • Rank 33,517 (Top 0.7 %)
  • Language
    Go
  • License
    Apache License 2.0
  • Created over 5 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

GPU Sharing Scheduler for Kubernetes Cluster

GPU Sharing Scheduler Extender in Kubernetes

CircleCI Build Status Go Report Card

Overview

More and more data scientists run their Nvidia GPU based inference tasks on Kubernetes. Some of these tasks can be run on the same Nvidia GPU device to increase GPU utilization. So one important challenge is how to share GPUs between the pods. The community is also very interested in this topic.

Now there is a GPU sharing solution on native Kubernetes: it is based on scheduler extenders and device plugin mechanism, so you can reuse this solution easily in your own Kubernetes.

Prerequisites

  • Kubernetes 1.11+
  • golang 1.19+
  • NVIDIA drivers ~= 361.93
  • Nvidia-docker version > 2.0 (see how to install and it's prerequisites)
  • Docker configured with Nvidia as the default runtime.

Design

For more details about the design of this project, please read this Design document.

Setup

You can follow this Installation Guide. If you are using Alibaba Cloud Kubernetes, please follow this doc to install with Helm Charts.

User Guide

You can check this User Guide.

Developing

Scheduler Extender

git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git && cd gpushare-scheduler-extender
make build-image

Device Plugin

git clone https://github.com/AliyunContainerService/gpushare-device-plugin.git && cd gpushare-device-plugin
docker build -t cheyang/gpushare-device-plugin .

Kubectl Extension

  • golang > 1.10
mkdir -p $GOPATH/src/github.com/AliyunContainerService
cd $GOPATH/src/github.com/AliyunContainerService
git clone https://github.com/AliyunContainerService/gpushare-device-plugin.git
cd gpushare-device-plugin
go build -o $GOPATH/bin/kubectl-inspect-gpushare-v2 cmd/inspect/*.go

Demo

- Demo 1: Deploy multiple GPU Shared Pods and schedule them on the same GPU device in binpack way

- Demo 2: Avoid GPU memory requests that fit at the node level, but not at the GPU device level

Related Project

Roadmap

  • Integrate Nvidia MPS as the option for isolation
  • Automated Deployment for the Kubernetes cluster which is deployed by kubeadm
  • Scheduler Extener High Availablity
  • Generic Solution for GPU, RDMA and other devices

Adopters

If you are intrested in GPUShare and would like to share your experiences with others, you are warmly welcome to add your information on ADOPTERS.md page. We will continuousely discuss new requirements and feature design with you in advance.

Acknowledgments

  • GPU sharing solution is based on Nvidia Docker2, and their gpu sharing design is our reference. The Nvidia Community is very supportive and We are very grateful.

More Repositories

1

k8s-for-docker-desktop

为Docker Desktop for Mac/Windows开启Kubernetes和Istio。
PowerShell
4,960
star
2

pouch

An Efficient Enterprise-class Container Engine
Go
4,626
star
3

log-pilot

Collect logs for docker containers
Go
1,429
star
4

kube-eventer

kube-eventer emit kubernetes events to sinks
Go
1,000
star
5

image-syncer

Docker image synchronization tool for Docker Registry V2 based services
Go
874
star
6

DevOps

阿里云容器服务持续交付
779
star
7

derrick

🐳A tool to help you containerize application in seconds
Go
685
star
8

terway

CNI plugin for Alibaba Cloud VPC/ENI
Go
550
star
9

gpushare-device-plugin

GPU Sharing Device Plugin for Kubernetes Cluster
Go
468
star
10

redis-cluster

HA Redis Cluster with Sentinel by Docker Compose
Shell
455
star
11

kubernetes-cronhpa-controller

⏰kubernetes-cronhpa-controller is a HPA controller that allows to scale your workload based on time schedule.
Go
443
star
12

docker-machine-driver-aliyunecs

Aliyun (Alibaba Cloud) ECS Driver of Docker Machine
Go
203
star
13

serverless-k8s-examples

Examples for Serverless Kubernetes on Alibaba Cloud - https://yq.aliyun.com/articles/591115
Go
158
star
14

ackdistro

Shell
122
star
15

flexvolume

FlexVolume plugin for Alibaba Cloud EBS/NAS/OSS, etc.
Go
109
star
16

jenkins-slaves

jenkins containerized slaves
Shell
107
star
17

alicloud-controller-manager

The official project is moved to https://github.com/kubernetes/cloud-provider-alibaba-cloud
Go
90
star
18

et-operator

Kubernetes Operator for AI and Bigdata Elastic Training
Go
84
star
19

sync-repo

Synchronize images from gcr.io, quay.io and Docker Hub to your Docker registry
Python
82
star
20

velero-plugin

Go
78
star
21

sgx-device-plugin

Kubernetes Device Plugin for Intel SGX
Go
67
star
22

alibaba-cloud-metrics-adapter

Kubernetes Custom Metrics API and External Metrics API for Alibaba Cloud
Go
55
star
23

maven-image

Maven Docker Image with Aliyun Mirror
Dockerfile
47
star
24

kubernetes-issues-solution

Kubernetes related issues solution
Shell
39
star
25

scaler

Java
39
star
26

roadmap

Product roadmap for Alibaba Cloud Container Services including ACK, ACR, ASK - Serverless K8S, ACK@Edge and ASM - Service Mesh
33
star
27

kube2ram

kube2ram provides different Alibaba Cloud RAM roles for pods running on ACK
Go
32
star
28

ack-ram-authenticator

Using Alibaba Cloud credentials to authenticate to a Kubernetes cluster
Go
31
star
29

alicloud-storage-provisioner

Alicloud Storage Provider for Kubernetes
Go
31
star
30

open-service-broker-alibabacloud

The Open Service Broker API implementation for Alibaba Cloud
Go
30
star
31

ack-image-builder

Custom Image Builder for ACK
Shell
28
star
32

ack-kms-plugin

KMS provider plugin for Alibaba Cloud
Go
27
star
33

spring-cloud-k8s-sample

This example demonstrate how to use AliCloud Container Service feature to build a spring-boot application leveraging with Spring Cloud capabilities.
Java
24
star
34

jenkins-demo

Java
23
star
35

spot-instance-advisor

spot-instance-advisor is command line tool to get the cheapest group of spot instanceTypes.
Go
22
star
36

solution-blockchain-demo

This is a repository for source codes of demo application and demo explorer for Blockchain Solution of Alibaba Cloud Container Service
JavaScript
22
star
37

helm-acr

Alibaba Cloud's Helm plugin to push chart package to ChartMuseum.
Go
22
star
38

docker-jenkins

Jenkins Docker Image which can set proper permission for local host volume
Shell
20
star
39

gpu-analyzer

GPU analyzer for Kubernetes GPU clusters
Go
17
star
40

ai-starter

Shell
17
star
41

benchmark-for-spark

benchmark-for-spark
HCL
16
star
42

terway-qos

The QoS project is a cloud-native solution leveraging eBPF technology, designed to efficiently manage and optimize network traffic across diverse hybrid deployment scenarios.
C
15
star
43

ack-secret-manager

ACK Secret Manager allows you to use external secret management systems (*e.g.*, Alibaba Cloud Secrets Manager) to securely add secrets in Kubernetes.
Go
15
star
44

kubeflow-aliyun

Deploy Kubeflow on Alibaba Cloud
14
star
45

monitoring-sample

Shell
13
star
46

cluster-api-provider-alibabacloud

Go
11
star
47

kubernetes-ops-handbook

Common kubernetes problems ops handbook.
10
star
48

hello-servicemesh-grpc

gRPC demo for ServiceMesh
Shell
9
star
49

jenkins-cos

Aliyun-Container-Service-plugin
Java
9
star
50

jenkins-on-serverless

9
star
51

rust-wasm-4-envoy

Shell
8
star
52

kubectl-autoscaler-plugin

7
star
53

ubuntu-image

Official Ubuntu Docker image with Aliyun mirror
7
star
54

centos-image

Official CentOS Docker image with Aliyun mirror
6
star
55

ack-ram-tool

Go
6
star
56

asm-labs

Go
6
star
57

ai-models-on-ack

Examples of deploying AI applications on ACK
Makefile
6
star
58

kubernetes-webhook-injector

Go
6
star
59

ghost-image

Ghost Blog Docker image with Aliyun OSS and MySQL
JavaScript
6
star
60

disk-snapshot

Support Aliyun Disk Snapshot in K8S without CSI Plugin
Go
5
star
61

prometheus-operator-charts

5
star
62

alibabacloud-ack-connector

Go
5
star
63

tsung-image

tsung docker image.
Shell
5
star
64

alpine-image

Official Alpine Docker Image with Aliyun Mirror
Shell
5
star
65

notation-alibabacloud-secret-manager

Go
4
star
66

data-on-ack

Examples of Data & AI/ML on Alibaba Cloud ACK by AI Suite
Go
4
star
67

debian-image

Official Debian Docker image with Aliyun mirror
3
star
68

demo-java

Java
3
star
69

node-resource-manager

3
star
70

nginx-sd-image

Python
2
star
71

demo-logstash

2
star
72

secrets-store-csi-driver-provider-alibaba-cloud

The Alibaba Cloud provider for the Secrets Store CSI Driver allows you to fetch secrets from Alibaba Cloud Secrets Manager and mount them into Kubernetes pods.
Go
2
star
73

python-image

Official Python image with Aliyun mirror of Pypi
Shell
2
star
74

grpc-transcoder

an envoyfiler generator for grpc-transcoder
Go
2
star
75

mpi-operator

Go
1
star
76

ack-tag-tool

Simple tool to tag all Alibaba Cloud resources used in specific ACK K8s cluster.
Python
1
star
77

wordpress-image

Official Wordpress Docker Image with Aliyun OSS plugin
PHP
1
star
78

haproxy-image

Official Haproxy Docker Image with Aliyun Mirror
1
star
79

cloud-environments

Makefile
1
star
80

argo-workflow-examples

Python
1
star
81

infrakit.aliyun

Infrakit plugins for Aliyun (Alibaba Cloud).
Go
1
star
82

node-image

Official NodeJS Docker Image with Taobao NPM mirror
1
star
83

demo-nodejs

JavaScript
1
star
84

gitops-demo

Smarty
1
star
85

ruby-image

Official Ruby Docker image with Ruby China mirror
Shell
1
star
86

jupyter-notebook

Jupyter Notebook Python, Scala, R, Spark, Mesos Stack
1
star
87

alibabacloud-erdma-controller

Go
1
star