• Stars
    star
    13,294
  • Rank 2,346 (Top 0.05 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

πŸ¦‰ ML Experiments and Data Management with Git

DVC logo

Website β€’ Docs β€’ Blog β€’ Tutorial β€’ Related Technologies β€’ How DVC works β€’ VS Code Extension β€’ Installation β€’ Contributing β€’ Community and Support

GHA Tests Python Version Codecov DOI

PyPI PyPI Downloads deb|pkg|rpm|exe Homebrew Conda-forge Chocolatey Snapcraft

Data Version Control or DVC is a command line tool and VS Code Extension to help you develop reproducible machine learning projects:

  1. Version your data and models. Store them in your cloud storage but keep their version info in your Git repo.
  2. Iterate fast with lightweight pipelines. When you make changes, only run the steps impacted by those changes.
  3. Track experiments in your local Git repo (no servers needed).
  4. Compare any data, code, parameters, model, or performance plots.
  5. Share experiments and automatically reproduce anyone's experiment.

Quick start

Please read our Command Reference for a complete list.

A common CLI workflow includes:

Task Terminal
Track data
$ git add train.py params.yaml
$ dvc add images/
Connect code and data
$ dvc stage add -n featurize -d images/ -o features/ python featurize.py
$ dvc stage add -n train -d features/ -d train.py -o model.p -M metrics.json python train.py
Make changes and experiment
$ dvc exp run -n exp-baseline
$ vi train.py
$ dvc exp run -n exp-code-change
Compare and select experiments
$ dvc exp show
$ dvc exp apply exp-baseline
Share code
$ git add .
$ git commit -m 'The baseline model'
$ git push
Share data and ML models
$ dvc remote add myremote -d s3://mybucket/image_cnn
$ dvc push

How DVC works

We encourage you to read our Get Started docs to better understand what DVC does and how it can fit your scenarios.

The closest analogies to describe the main DVC features are these:

  1. Git for data: Store and share data artifacts (like Git-LFS but without a server) and models, connecting them with a Git repository. Data management meets GitOps!
  2. Makefiles for ML: Describes how data or model artifacts are built from other data and code in a standard format. Now you can version your data pipelines with Git.
  3. Local experiment tracking: Turn your machine into an ML experiment management platform, and collaborate with others using existing Git hosting (Github, Gitlab, etc.).

Git is employed as usual to store and version code (including DVC meta-files as placeholders for data). DVC stores data and model files seamlessly in a cache outside of Git, while preserving almost the same user experience as if they were in the repo. To share and back up the data cache, DVC supports multiple remote storage platforms - any cloud (S3, Azure, Google Cloud, etc.) or on-premise network storage (via SSH, for example).

how_dvc_works

DVC pipelines (computational graphs) connect code and data together. They specify all steps required to produce a model: input dependencies including code, data, commands to run; and output information to be saved.

Last but not least, DVC Experiment Versioning lets you prepare and run a large number of experiments. Their results can be filtered and compared based on hyperparameters and metrics, and visualized with multiple plots.

VS Code Extension

To use DVC as a GUI right from your VS Code IDE, install the DVC Extension from the Marketplace. It currently features experiment tracking and data management, and more features (data pipeline support, etc.) are coming soon!

DVC Extension for VS Code

Note: You'll have to install core DVC on your system separately (as detailed below). The Extension will guide you if needed.

Installation

There are several ways to install DVC: in VS Code; using snap, choco, brew, conda, pip; or with an OS-specific package. Full instructions are available here.

Snapcraft (Linux)

Snapcraft

snap install dvc --classic

This corresponds to the latest tagged release. Add --beta for the latest tagged release candidate, or --edge for the latest main version.

Chocolatey (Windows)

Chocolatey

choco install dvc

Brew (mac OS)

Homebrew

brew install dvc

Anaconda (Any platform)

Conda-forge

conda install -c conda-forge mamba # installs much faster than conda
mamba install -c conda-forge dvc

Depending on the remote storage type you plan to use to keep and share your data, you might need to install optional dependencies: dvc-s3, dvc-azure, dvc-gdrive, dvc-gs, dvc-oss, dvc-ssh.

PyPI (Python)

PyPI

pip install dvc

Depending on the remote storage type you plan to use to keep and share your data, you might need to specify one of the optional dependencies: s3, gs, azure, oss, ssh. Or all to include them all. The command should look like this: pip install 'dvc[s3]' (in this case AWS S3 dependencies such as boto3 will be installed automatically).

To install the development version, run:

pip install git+git://github.com/iterative/dvc

Package (Platform-specific)

deb|pkg|rpm|exe

Self-contained packages for Linux, Windows, and Mac are available. The latest version of the packages can be found on the GitHub releases page.

Ubuntu / Debian (deb)

sudo wget https://dvc.org/deb/dvc.list -O /etc/apt/sources.list.d/dvc.list
wget -qO - https://dvc.org/deb/iterative.asc | sudo apt-key add -
sudo apt update
sudo apt install dvc

Fedora / CentOS (rpm)

sudo wget https://dvc.org/rpm/dvc.repo -O /etc/yum.repos.d/dvc.repo
sudo rpm --import https://dvc.org/rpm/iterative.asc
sudo yum update
sudo yum install dvc

Contributing

Code Climate

Contributions are welcome! Please see our Contributing Guide for more details. Thanks to all our contributors!

Contributors

Community and Support

Copyright

This project is distributed under the Apache license version 2.0 (see the LICENSE file in the project root).

By submitting a pull request to this project, you agree to license your contribution under the Apache license version 2.0 to this project.

Citation

DOI

Iterative, DVC: Data Version Control - Git for Data & Models (2020) DOI:10.5281/zenodo.012345.

Barrak, A., Eghan, E.E. and Adams, B. On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects , in Proceedings of the 28th IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2021. Hawaii, USA.

More Repositories

1

cml

♾️ CML - Continuous Machine Learning | CI/CD for ML
JavaScript
3,996
star
2

datachain

AI-dataframe to enrich, transform and analyze data from cloud storages for ML training and LLM apps
Python
757
star
3

mlem

🐢 A tool to package, serve, and deploy any ML model on any platform. Archived to be resurrected one day🀞
Python
717
star
4

PyDrive2

Google Drive API Python wrapper library. Maintained fork of PyDrive.
Python
565
star
5

shtab

↔️ Automagic shell tab completion for Python CLI applications
Python
362
star
6

dvc.org

πŸ“– DVC website and documentation
TypeScript
320
star
7

terraform-provider-iterative

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes
Go
288
star
8

vscode-dvc

Machine learning experiment tracking and data versioning with DVC extension for VS Code
TypeScript
187
star
9

example-get-started

Get started DVC project
Python
167
star
10

dvclive

πŸ“ˆ Log and track ML metrics, parameters, models with Git and/or DVC
Python
161
star
11

gto

🏷️ Git Tag Ops. Turn your Git repository into Artifact Registry or Model Registry.
Python
138
star
12

awesome-iterative-projects

A list of projects relying on Iterative.AI tools to achieve awesomeness
64
star
13

dataset-registry

Dataset registry DVC project
60
star
14

magnetic-tiles-defect

Demo Computer Vision Project
Jupyter Notebook
59
star
15

example-get-started-experiments

Get started DVC project
Python
45
star
16

course-ds-base

Jupyter Notebook
44
star
17

demo-bank-customer-churn

Demo DVC project training a classification model on tabular data
Jupyter Notebook
38
star
18

aita_dataset

AITA dataset based on r/AmItheAsshole/
Python
33
star
19

setup-dvc

DVC GitHub action
JavaScript
30
star
20

ldb-resources

Python
27
star
21

mlem.ai

✨ Landing page for MLEM
TypeScript
27
star
22

example_cml

Python
27
star
23

workshop-uncool-mlops

Accompanies the uncool MLOps workshop
Python
26
star
24

setup-cml

GitHub Action for CML setup
TypeScript
24
star
25

scmrepo

SCM wrapper and fsspec filesystem for Git for use in DVC.
Python
21
star
26

cml_base_case

Python
21
star
27

example-repos-dev

Source code and generator scripts for example DVC projects
Python
21
star
28

cml_cloud_case

Python
20
star
29

dvc-bench

Benchmarks for DVC
Shell
20
star
30

cml_dvc_case

Python
18
star
31

dvc-data

DVC's data management subsystem
Python
18
star
32

intellij-dvc

DVC integration plugin for Intellij IDEs including PyCharm, IntelliJ IDEA and CLion
Java
17
star
33

studio-support

❓ DVC Studio Issues, Question, and Discussions
16
star
34

pytest-servers

Create temporary directories on the various filesystems for testing
Python
15
star
35

studio-selfhosted

This repository contains auxiliary installation code for self-hosting Studio
Shell
14
star
36

py-template

Hypermodern Python Cookiecutter
Python
14
star
37

VSCode-DVC-Workshop

Workshop about DVC VSCode Extension
Jupyter Notebook
14
star
38

example-dvc-experiments

DVC Get Started Project with a focus on `dvc experiment` features.
HTML
13
star
39

cml.dev

πŸ”— CML website and documentation
TypeScript
12
star
40

dvcyaml-schema

Schema for dvc.yaml
Python
10
star
41

dvc-objects

dvc objects - contains filesystem and object-db level abstractions to use in dvc and dvc-data
Python
10
star
42

example-versioning

Data sets and ML models versioning example from DVC get started
Python
9
star
43

dvc-streamlit-components

Streamlit components for DVC
Python
9
star
44

morefs

A collection of self-contained fsspec-based filesystems
Python
9
star
45

priority-list

⛏️ Make a dent in GitHub issue & PR backlogs across repositories
Python
8
star
46

cml_tensorboard_case

Python
8
star
47

dvc-s3

AWS S3 plugin for dvc
Python
8
star
48

llm-demo

Demo of using DVC with LangChain
Python
8
star
49

example-mlem-get-started

Get Started MLEM project
Python
7
star
50

dvc-task

Celery task queue used in DVC
Python
7
star
51

tpi

Python wrapper for terraform-provider-iterative
Python
7
star
52

example-gto

Get Started GTO Project
7
star
53

example-pokemon-classifier

Example project with a CNN to train a PokΓ©mon type classifier.
Python
7
star
54

dvc-render

Library for rendering DVC plots
Python
6
star
55

dvc-studio-client

Client to interact with DVC Studio
Python
6
star
56

pytest-test-utils

Python
6
star
57

workshop-uncool-mlops-solution

Python
6
star
58

gatsby-theme-iterative

A Gatsby theme for shared logic between all the websites from iterative.ai
JavaScript
6
star
59

stale-model-example

This is the repo for the Preventing Stale Models in Production blog post.
Jupyter Notebook
6
star
60

gto-action

βš™οΈ GTO Github Action
Shell
6
star
61

evidently-dvc

Tutorial: Automate Data Validation and Model Monitoring Pipelines with DVC and Evidently
HTML
6
star
62

features

A collection of development container 'features' for machine learning and data science
Shell
6
star
63

course-checkpoints-project

This is the project we use for the DVC educational course to demonstrate how checkpoints work.
HTML
6
star
64

sqltrie

SQL-based prefix tree implementation inspired by pygtrie and python-diskcache
Python
5
star
65

dvc-checkpoints-mnist

Example of checkpoints in a DVC project training a simple convolutional neural net to classify MNIST data
Python
5
star
66

dvc-s3-repo

Maintain deb and rpm repositories on s3
Python
5
star
67

enhancement-proposals

5
star
68

dvc-gs

Google Storage plugin for dvc
Python
4
star
69

link-check

A Node-based tool to verify if links are alive. Built to be used anywhere!
TypeScript
4
star
70

dvc-snap

dvc snap package
Shell
4
star
71

sagemaker-pipeline

An example project, showcasing a DVC pipeline using SageMaker SDK for data preparation and model training
Python
4
star
72

cml-runner-base-case

Python
4
star
73

cml-playground

Shell
4
star
74

pretrained-model-demo

Python
4
star
75

homebrew-dvc

Automatic updates for dvc homebrew package
Shell
4
star
76

vscode-dvc-demo

Python
3
star
77

example_model_export_cml

Example on how to use CML to provision an AWS EC2 runner, train a model, and export the resulting model.
Python
3
star
78

dvc-azure

Azure plugin for dvc
Python
3
star
79

cnn_tutorial

CNN tutorial for DVC
Python
3
star
80

telemetry-python

Common library to send usage telemetry
Python
3
star
81

dvc-learn-project

This is the project used in the DVC Learn Meetups and videos.
HTML
3
star
82

checkpoints-tutorial

This is the code used in the checkpoints tutorial.
Python
3
star
83

ldb

Python
3
star
84

chocolatey-dvc

Chocolatey package for dvc
PowerShell
3
star
85

blog-tpi-jupyter

Terraform Provider Iterative + Jupyter + TensorBoard + AWS/Azure/GCP/K8s
Jupyter Notebook
3
star
86

dvc-gdrive

Google Drive plugin for DVC
Python
2
star
87

link-check.action

A GitHub Action driver for link-check, deployed via submodules.
JavaScript
2
star
88

katacoda-scenarios

Interactive Katacoda Scenarios
Shell
2
star
89

dvc-exe

Private repository for building and signing dvc for windows
Inno Setup
2
star
90

jameson-metrics

Metrics examples
Python
2
star
91

example-mlem

Example of using MLEM with DVC Pipeline
Python
2
star
92

example-get-started-s3

Example get started (metrics and plots in S3)
Python
2
star
93

dvc_action_example

Python
2
star
94

dvc-oss

Alibaba OSS plugin for dvc
Python
2
star
95

dvc-test

Integration tests for dvc
Python
2
star
96

reflink-copy

Python wrapper for `reflink_copy` Rust library
Python
2
star
97

vscode-dvc-pack

2
star
98

dvclive-exp-tracking

Example repo to show how to start tracking experiments in DVC by adding DVCLive to your Python code.
Jupyter Notebook
2
star
99

cookiecutter-dvc-plugin

A Cookiecutter template for dvc plugins
Python
2
star
100

testing-ldb

Aug 10th Hackathon
Python
2
star