• Stars
    star
    823
  • Rank 55,417 (Top 2 %)
  • Language
    Go
  • License
    Other
  • Created about 5 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

GPU Manager

Build Status

GPU Manager is used for managing the nvidia GPU devices in Kubernetes cluster. It implements the DevicePlugin interface of Kubernetes. So it's compatible with 1.9+ of Kubernetes release version.

To compare with the combination solution of nvidia-docker and nvidia-k8s-plugin, GPU manager will use native runc without modification but nvidia solution does. Besides we also support metrics report without deploying new components.

To schedule a GPU payload correctly, GPU manager should work with gpu-admission which is a kubernetes scheduler plugin.

GPU manager also supports the payload with fraction resource of GPU device such as 0.1 card or 100MiB gpu device memory. If you want this kind feature, please refer to vcuda-controller project.

Build

1. Build binary

  • Prerequisite
    • CUDA toolkit
make

2. Build image

  • Prerequisite
    • Docker
make img

Prebuilt image

Prebuilt image can be found at thomassong/gpu-manager

Deploy

GPU Manager is running as daemonset, and because of the RABC restriction and hydrid cluster, you need to do the following steps to make this daemonset run correctly.

  • service account and clusterrole
kubectl create sa gpu-manager -n kube-system
kubectl create clusterrolebinding gpu-manager-role --clusterrole=cluster-admin --serviceaccount=kube-system:gpu-manager
  • label node with nvidia-device-enable=enable
kubectl label node <node> nvidia-device-enable=enable
  • submit daemonset yaml
kubectl create -f gpu-manager.yaml

Pod template example

There is nothing special to submit a Pod except the description of GPU resource is no longer 1 . The GPU resources are described as that 100 tencent.com/vcuda-core for 1 GPU and N tencent.com/vcuda-memory for GPU memory (1 tencent.com/vcuda-memory means 256Mi GPU memory). And because of the limitation of extend resource validation of Kubernetes, to support GPU utilization limitation, you should add tencent.com/vcuda-core-limit: XX in the annotation field of a Pod.

Notice: the value of tencent.com/vcuda-core is either the multiple of 100 or any value smaller than 100.For example, 100, 200 or 20 is valid value but 150 or 250 is invalid

  • Submit a Pod with 0.3 GPU utilization and 7680MiB GPU memory with 0.5 GPU utilization limit
apiVersion: v1
kind: Pod
metadata:
  name: vcuda
  annotations:
    tencent.com/vcuda-core-limit: 50
spec:
  restartPolicy: Never
  containers:
  - image: <test-image>
    name: nvidia
    command:
    - /usr/local/nvidia/bin/nvidia-smi
    - pmon
    - -d
    - 10
    resources:
      requests:
        tencent.com/vcuda-core: 50
        tencent.com/vcuda-memory: 30
      limits:
        tencent.com/vcuda-core: 50
        tencent.com/vcuda-memory: 30
  • Submit a Pod with 2 GPU card
apiVersion: v1
kind: Pod
metadata:
  name: vcuda
spec:
  restartPolicy: Never
  containers:
  - image: <test-image>
    name: nvidia
    command:
    - /usr/local/nvidia/bin/nvidia-smi
    - pmon
    - -d
    - 10
    resources:
      requests:
        tencent.com/vcuda-core: 200
        tencent.com/vcuda-memory: 60
      limits:
        tencent.com/vcuda-core: 200
        tencent.com/vcuda-memory: 60

FAQ

If you have some questions about this project, you can first refer to FAQ to find a solution.

More Repositories

1

tke

Native Kubernetes container management platform supporting multi-tenant and multi-cluster
Go
1,474
star
2

kvass

Kvass is a Prometheus horizontal auto-scaling solution , which uses Sidecar to generate special config file only containes part of targets assigned from Coordinator for every Prometheus shard.
Go
614
star
3

kstone

Kstone is an etcd management platform, providing cluster management, monitoring, backup, inspection, data migration, visual viewing of etcd data, and intelligent diagnosis.
Go
526
star
4

vcuda-controller

C
494
star
5

gpu-admission

Go
128
star
6

tapp

Another great app kind for Kubernetes!
Go
113
star
7

galaxy

Providing high-performance network for Kubernetes
Go
109
star
8

image-transfer

Go
74
star
9

cron-hpa

Cron Horizontal Pod Autoscaler(CronHPA)
Go
60
star
10

lb-controlling-framework

A controller that helps you manipulate arbitrary load balancers
Go
56
star
11

charts

Mustache
40
star
12

docs

https://tkestack.github.io/docs/
37
star
13

go-nvml

Go
31
star
14

kube-jarvis

Go
28
star
15

tke-k8s-distro

Repository of Tencent TKE Kubernetes Distribution
Shell
28
star
16

yarn-opterator

An kubernetes opterator to manager the yarn nodemanager, which can be used to deploy yarn on kubernetes.
Go
22
star
17

volume-decorator

Go
14
star
18

csi-operator

Go
14
star
19

kstone-dashboard

Kstone Dashboard is a web-based UI for managing Kstone. It can help users to manage etcd based kstone and enjoy monitoring, inspection, visualization, backup and other features.
TypeScript
14
star
20

kstone-etcd-operator

kstone-etcd-operator is a subproject of etcd cluster management platform kstone. It's inspired by etcd-operator. And has more complete support for persistent storage and better disaster tolerance.
Go
11
star
21

toc

7
star
22

knitnet-operator

Operator for support direct networking between Pods and Services in different Kubernetes clusters
Go
7
star
23

gpu-exporter

C
5
star
24

ronqir

CRI runtime in Rust
5
star
25

gc-controller

Go
4
star
26

kube-event-watcher

Go
4
star
27

aia-ip-controller

Aia-ip-controller is a Kubernetes application, which make it possible for Tencent Kubernetes Engine(TKE) customers to leverage Tencent Cloud's AIa Service to accelerate network with global coverage.
Go
4
star
28

tke-extend-network-controller

A Network Controller for TKE
Go
3
star
29

web

TKEStack website
HTML
1
star