• Stars
    star
    151
  • Rank 246,057 (Top 5 %)
  • Language
    Jupyter Notebook
  • License
    Apache License 2.0
  • Created over 6 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🌻 Flow with FlorDB

FlorDB: Nimble Experiment Management for Iterative ML

Flor (for "fast low-overhead recovery") is a record-replay system for deep learning, and other forms of machine learning that train models on GPUs. Flor was developed to speed-up hindsight logging: a cyclic-debugging practice that involves adding logging statements after encountering a surprise, and efficiently re-training with more logging. Flor takes low-overhead checkpoints during training, or the record phase, and uses those checkpoints for replay speedups based on memoization and parallelism.

FlorDB integrates Flor, git and sqlite3 to manage model developer's logs, execution data, versions of code, and training checkpoints. In addition to serving as an experiment management solution for ML Engineers, FlorDB extends hindsight logging across model trainging versions for the retroactive evaluation of iterative ML.

Flor and FlorDB are software developed at UC Berkeley's RISE Lab.

Installation

pip install flordb

Demo

Napa Retreat Demo

First run

Run the examples/rnn.py script to test your installation. This script will train a small linear model on MNIST. FLOR shadow branches permit us to commit your work automatically on every run, without interfering with your other commits. You can later review and merge the flor shadow branch as you would any other git branch.

git checkout -b flor.shadow
python examples/rnn.py --flor readme

When finished, you will have committed to the shadow branch and written execution metadata into a .flor directory in your current directory. Additionally, flor created a directory tree in your HOME to organize your experiments. You can find our running experiment as follows:

ls ~/.flor/

Confirm that Flor saved checkpoints of the examples/rnn.py execution on your home directory. Flor will access and interpret contents of ~/.flor automatically. You should routinely clear this stash or spool it to the cloud to clear up disk space.

View your experiment history

From the same directory you ran the examples above, open an iPython terminal, then load and pivot the log records.

In [1]: from flor import full_pivot, log_records
In [2]: full_pivot(log_records())
Out[2]: 
     runid              tstamp  epoch  step device learning_rate                 loss
0   readme 2023-03-12 12:23:53      1   100    cpu          0.01   0.5304957032203674
1   readme 2023-03-12 12:23:53      1   200    cpu          0.01  0.21829535067081451
2   readme 2023-03-12 12:23:53      1   300    cpu          0.01  0.15856705605983734
3   readme 2023-03-12 12:23:53      1   400    cpu          0.01  0.11441942304372787
4   readme 2023-03-12 12:23:53      1   500    cpu          0.01  0.06835074722766876
5   readme 2023-03-12 12:23:53      1   600    cpu          0.01  0.13750575482845306
6   readme 2023-03-12 12:23:53      2   100    cpu          0.01  0.11708579957485199
7   readme 2023-03-12 12:23:53      2   200    cpu          0.01  0.08852845430374146
8   readme 2023-03-12 12:23:53      2   300    cpu          0.01  0.16527307033538818
9   readme 2023-03-12 12:23:53      2   400    cpu          0.01  0.11036019027233124
10  readme 2023-03-12 12:23:53      2   500    cpu          0.01  0.05740281194448471
11  readme 2023-03-12 12:23:53      2   600    cpu          0.01  0.07785198092460632

Model Training Kit (MTK)

The MTK includes utilities for serializing and checkpointing PyTorch state, and utilities for resuming, auto-parallelizing, and memoizing executions from checkpoint. The model developer passes objects for checkpointing to flor, and gives it control over loop iterators by calling MTK.checkpoints and MTK.loop as follows:

from flor import MTK as Flor

import torch

trainloader: torch.utils.data.DataLoader
testloader:  torch.utils.data.DataLoader
optimizer:   torch.optim.Optimizer
net:         torch.nn.Module
criterion:   torch.nn._Loss

Flor.checkpoints(net, optimizer)
for epoch in Flor.loop(range(...)):
    for data in Flor.loop(trainloader):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    eval(net, testloader)

As shown, we pass the neural network and optimizer to Flor for checkpointing with Flor.checkpoints(net, optimizer). We wrap both the nested training loop and main loop with Flor.loop. This lets Flor jump to an arbitrary epoch using checkpointed state, and skip the nested training loop when intermediate state isn't probed.

Hindsight Logging

from flor import MTK as Flor
import torch

trainloader: torch.utils.data.DataLoader
testloader:  torch.utils.data.DataLoader
optimizer:   torch.optim.Optimizer
net:         torch.nn.Module
criterion:   torch.nn._Loss

for epoch in Flor.loop(range(...)):
    for batch in Flor.loop(trainloader):
        ...
    eval(net, testloader)
    log_confusion_matrix(net, testloader)

Suppose you want to view a confusion matrix as it changes throughout training. Add the code to generate the confusion matrix, as sugared above.

python3 mytrain.py --replay_flor PID/NGPUS [your_flags]

As before, you tell FLOR to run in replay mode by setting --replay_flor. You'll also tell FLOR how many GPUs from the pool to use for parallelism, and you'll dispatch this script simultaneously, varying the pid:<int> to span all the GPUs. To run segment 3 out of 5 segments, you would write: --replay_flor 3/5.

If instead of replaying all of training you wish to re-execute only a fraction of the epochs you can do this by setting the value of ngpus and pid respectively. Suppose you want to run the tenth epoch of a training job that ran for 200 epochs. You would set pid:9and ngpus:200.

Publications

To cite this work, please refer to the Hindsight Logging paper (VLDB '21).

FLOR is open source software developed at UC Berkeley. Joe Hellerstein (databases), Joey Gonzalez (machine learning), and Koushik Sen (programming languages) are the primary faculty members leading this work.

This work is released as part of Rolando Garcia's doctoral dissertation at UC Berkeley, and has been the subject of study by Eric Liu and Anusha Dandamudi, both of whom completed their master's theses on FLOR. Our list of publications are reproduced below. Finally, we thank Vikram Sreekanti, Dan Crankshaw, and Neeraja Yadwadkar for guidance, comments, and advice. Bobby Yan was instrumental in the development of FLOR and its corresponding experimental evaluation.

License

FLOR is licensed under the Apache v2 License.

More Repositories

1

confluo

Real-time Monitoring and Analysis of Data Streams
C++
1,442
star
2

clipper

A low-latency prediction-serving system
C++
1,399
star
3

anna

Java
448
star
4

cs294-ai-sys-sp19

CS294; AI For Systems and Systems For AI
HTML
220
star
5

actnn

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
Python
201
star
6

graphtrans

Representing Long-Range Context for Graph Neural Networks with Global Attention
Python
122
star
7

cirrus

Serverless ML Framework
C++
107
star
8

piranha

Piranha: A GPU Platform for Secure Computation
C++
88
star
9

cpp_project_template

Minimal C++ Project Template
C++
38
star
10

risecamp

RISECamp Tutorials
Jupyter Notebook
38
star
11

jedi-pairing

Go, C++, and C implementation of a bilinear group and pairing-based cryptography for both embedded and non-embedded systems
C++
34
star
12

dory

Go
32
star
13

cs294-rise-fa16

CS294 RISE Course Material
32
star
14

cs294-ai-sys-fa19

CS294-162; Machine Learning Systems Seminar
HTML
31
star
15

hypersched

Deadline-based hyperparameter tuning on RayTune.
Python
31
star
16

LatticeFlow

C++
28
star
17

mage

MAGE: Memory-Aware Garbling Engine
C++
25
star
18

aws-audit

aws consolidated billing audit/reporting tool
Python
24
star
19

waldo

C++
23
star
20

snoopy

A high-throughput oblivious storage system
C
23
star
21

MerkleSquare

A Go library for MerkleSquare: A Low-Latency Transparency Log System
Go
20
star
22

clipper-tutorials

Jupyter Notebook
17
star
23

fluent-old

Bloom + C++
JavaScript
17
star
24

cs294-ai-sys-sp22

CS294 AI Systems Class Website
SCSS
15
star
25

jarvis

Build, configure, and track workflows with Jarvis.
Python
13
star
26

caravel

Studying GPU Multi-tenancy
Jupyter Notebook
12
star
27

jedi-protocol-go

Golang implementation of JEDI: Many-to-Many End-to-End Encryption and Key Delegation for IoT
Go
12
star
28

costco

Python
10
star
29

cs262a-fall2020

CS 262a Fall 2020
HTML
6
star
30

tcplp

Performant TCP for Low-Power Wireless Networks
5
star
31

mage-scripts

Benchmarking scripts for MAGE
Python
5
star
32

clipper-website

Hugo sources for clipper.ai website
CSS
4
star
33

kaggle-nlp-disasters

https://www.kaggle.com/c/nlp-getting-started
Jupyter Notebook
3
star
34

clipper-serving-testbed

Python
2
star
35

fluent_scala

Scala
2
star
36

netplay

C++
1
star
37

buddy

An Python/SQL DSL for CALM cloud programming
Python
1
star
38

flor-camp2018

Jupyter Notebook
1
star
39

maceta

Delta Pruning: pruning spurious/deleterious changes that arise organically from iterative model development
1
star