• Stars
    star
    106
  • Rank 325,871 (Top 7 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created over 7 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python script which monitors gpu access

gpu_mon

Python script which monitors gpu access and manages external programs when GPU is idle

Blog post on medium.com

How

Every N seconds it checks /dev/nvidiaX device file for other processes accessing them (using fuser tool). According to this, it treats gpu card as idle or busy. If GPU becomes idle, gpu_mon starts external program, which will be stopped when device file will be accessed by other process.

Additionally, it can monitor PTY sessions to check that there are active users logged in and stop it's processes to avoid influence with somebody's work.

Why

  1. Monitoring and reporting of gpu usage
  2. Mining on idle hardware.

Requirements and limitations

Written on python3. Developed and tested under linux (ubuntu and debian) with NVidia cards. Don't have external libraries requirements except standard library.

To see all processes from all users, script need either run as root or fuser tool need to have suid bit.

Running

  1. create configuration file in ~/.config/gpu_mon.conf from provided template in conf dir
  2. run ./gpu_mon.py

Configuration

Basic configuration file looks like this and mostly self-explaining:

[defaults]
; how frequently perform GPU and tty checks
interval_seconds=10

; configuration of GPUs to monitor for external program access. It could be several such sections with 'gpu-' prefix
[gpu-all]
; list of comma-separated gpu indices or ALL to handle all available gpus
gpus=ALL
; comma-separated list of programs which can access gpu and should be ignored
ignore_programs=nvidia-smi

; program which will be started on gpu during idle time
[process-all]
dir=/tmp
cmd=miner-run
; list of gpu indices or ALL to handle all available gpus
gpus=ALL
; log for processes. If not specified, show on console, if specified to file, data will be appended
log=/dev/null

; configuration of tty monitoring if enabled and user is active and not in whitelist, all processes will be stopped
[tty]
enabled=True
whitelist=user1,user2
; how long user should be inactive in tty to be ignored by checker
idle_seconds=300

Above configuration will check all /dev/nvidiaN device files for open every 10 seconds and if nobody uses them (nvidia-smi command is ignored), start command miner-run, which should occupy all GPUs. If some program (like TF or pytorch) will open any of GPUs, miner will be terminated.

It's possible to fine-tune individual GPU access, which allows you to preempt miners or individual GPUs. In this example, we define per-gpu miner process which gets started with proper CUDA_VISIBLE_DEVICE variable set:

[defaults]
; how frequently perform GPU and tty checks
interval_seconds=10

; configuration of GPUs to monitor for external program access. It could be several such sections with 'gpu-' prefix
[gpu-all]
; list of comma-separated gpu indices or ALL to handle all available gpus
gpus=ALL
; comma-separated list of programs which can access gpu and should be ignored
ignore_programs=nvidia-smi

; program which will be started on gpu during idle time
[process-0]
dir=/tmp
cmd=miner-run
; list of gpu indices or ALL to handle all available gpus
gpus=0
; log for processes. If not specified, show on console, if specified to file, data will be appended
log=/dev/null

; program which will be started on gpu during idle time
[process-1]
dir=/tmp
cmd=miner-run  
; list of gpu indices or ALL to handle all available gpus
gpus=1
; log for processes. If not specified, show on console, if specified to file, data will be appended
log=/dev/null

; configuration of tty monitoring if enabled and user is active and not in whitelist, all processes will be stopped
[tty]
enabled=True
whitelist=user1,user2
; how long user should be inactive in tty to be ignored by checker
idle_seconds=300

Donations

If you find this useful and want to support developer, please consider donation.

  • BTC: 1FvhCby4UNtHmm2DFzzFRvfDL64uLSt4CN
  • BCC: 18cpNK3LmH7mbYkvUyDo6TSji4zGraCsu
  • ZEC: t1WMErz3JZZwkK1NLVadoi9ydgFHZhPHrWo
  • KMD: R9uk6UARL1vbyoGuP8NQNf8JvTxyRA1Xt1
  • Paypal: https://www.paypal.me/shmuma

More Repositories

1

ptan

PyTorch Agent Net: reinforcement learning toolkit for pytorch
Python
528
star
2

rethinking-2ed-julia

Port of Statistical Rethinking (2nd edition) code to Julia
Jupyter Notebook
123
star
3

rl

RL experiments
Jupyter Notebook
69
star
4

sqlite3-mt4-wrapper

Wrapper DLL for sqlite3 usage from MT4
C
42
star
5

flipper-zero-pocsag

C
9
star
6

nlp

Various NLP-related stuff
Python
9
star
7

Dagitty.jl

Graphical Analysis of Structural Causal Models
Julia
6
star
8

xen-oses

Xen kernels inspired by reading the 'Defenitive guide to the Xen hypervisor' book.
C
6
star
9

shmuma.ru

shmuma.ru site source
4
star
10

klife

Study project for life game in kernel
C
4
star
11

deep-traffic-2019

Python
4
star
12

garmin-public

Opensource for Garmin devices
4
star
13

maps-proxy

Various proxy server-side scripts for maemo-mapper
Python
3
star
14

mt-tools

Metatrader misc tools
3
star
15

shelf_utils

Shelf wrappers around sg_ses (shelf enclosure service) to management of disk shelves from linux.
3
star
16

z

z
C
2
star
17

open-mapper

Collection of tools for maemo-mapper on desktop.
Common Lisp
2
star
18

grain-watering

System for grain watering
C++
2
star
19

mt4r

R for MetaTrader4
Pascal
2
star
20

pureftpd-o_direct

Patch for pureftpd to write all uploads with O_DIRECT to prevent disk cache pollution
C
2
star
21

maemo-mapper-shmuma

Improved version of maemo mapper
C
2
star
22

radio

Electronic projects. Schematics and other stuff.
Eagle
2
star
23

quote-tracker

GAE application which performs fetching, processing and providing various financial data
Python
2
star
24

scst_local

SCST local target with refresh functionality
2
star
25

tablix-operators

Tablix modules for operator's schedule problem.
Shell
2
star
26

sample_mods

Collection of sample kernel modules
C
1
star
27

xenwatch

Xen monitoring module
C
1
star
28

hdpstat

Hadoop/HBase statistics tools
Python
1
star
29

Anki

Anki flashcard data
1
star
30

cc

cc solutions
C
1
star
31

tfo_probe

Experiments with TCP Fast Open
Python
1
star
32

hadoop

cdh hadoop
Java
1
star
33

hbase

HBase
Java
1
star
34

hbase-trunk

My work clone of HBase trunk
Java
1
star
35

cs-lects

Place to store small projects inspired by CS video lectiures, as CS61A (aka SICP), etc
Racket
1
star
36

wifi-scanner

Turning HP ScanJet 200 into wifi scanner
Python
1
star
37

music

1
star
38

cinfo

cinfo tool fixes
C
1
star
39

eagle-libs

Eagle component libraries
1
star
40

RadioBoat

C
1
star
41

yandex-traffic-widget

Yandex.Traffic desktop widget for n900.
C++
1
star
42

perfdb

Performance benchmarks utility
Common Lisp
1
star
43

acm.uva.es

acm.uva.es solutions
C
1
star
44

chart-markup

Financial charts markup app
Python
1
star
45

flipper-zero-tests

Test app
C
1
star