Data Science Process Management

This repository compiles resources for data science process management.

It is challenging to establish good process around data science tasks. Many data science tasks end up with one-off solutions that cannot be used in other tasks. Data science tasks don't fit well to either software engineering or product engineering process. Also different data science tasks require different processes.

Yet, it is crucial to have well defined process, because, without it, organizations cannot accumulate knowledge internally, and have to rely on individuals to perform critical tasks. People come and go, while process sticks around.

While there is no good standard so far, there have been active conversations around the issue.

By compiling those, we hope that it helps us get closer to the best practices in data science process management.

Understand Data Science Tasks

Why Managing Data Scientists Is Different by Roger M. Stein at MIT
Big Data – From Descriptive to Prescriptive by Cameron Cramer
The 4 Types of Data Analytics by Thomas Maydon, Principa
Closing the Insights-to-Action Gap by Gary Robinson, Lytix.
Thoughts on Managing Data Science Team Workstreams by Harlan Harris

Data Analytics Process

What is the Workflow or Process of a Data Scientist by Ryan Fox Squire on Quora

Data Science Product Development Process

Use Cases

Meson: Workflow Orchestration for Netflix Recommendations at Netflix Technical Blog
Meet Michelangelo: Uber’s Machine Learning Platform at Uber Engineering Blog
Bighead: Airbnb’s End-to-End Machine Learning Platform by Krishna Puttaswamy and Nick Handel
Data science at eHarmony: A generalized framework for personalization by Jonathan Morra at Strata 2016
Continuous Delivery for Machine Learning - Automating the end-to-end lifecycle of ML applications by Danilo Sato, Arif Wider, and Christoph Windheuser on Sep. 2019

Articles

Meet Michelangelo: Uber’s Machine Learning Platform (blog post); also see scientific paper and talk video
Data Science != Software Engineering by Domino Data Lab
The Practical Guide to Managing Data Science at Scale by Domino Data Lab
Rules of Machine Learning: Best Practices for ML Engineering by Martin Zinkevich at Google
Distributed Time Travel for Feature Generation by Hossein Taghavi, Prasanna Padmanabhan, DB Tsai, Faisal Zakaria Siddiqi, and Justin Basilico at Netflix
What's Your ML Test Score? A Rubric for ML Production Systems by Eric Breck et al. at Reliable Machine Learning in the Wild - NIPS 2016 Workshop
Machine Learning: The High Interest Credit Card of Technical Debt by D. Sculley et al. at SE4ML - NIPS 2014 Workshop
Machine Learning in Production by Szilard Pafka at Epoch (Video at Data Science LA meetup)
Insights from a Predictive Model Pipeline Abstraction by Harlan Harris
Moving Towards Managing AI Products - 10 Lessons for Building AI-Driven Products by Prasad Velamuri

Videos

Non-Flink Machine Learning on Flink by Ted Dunning at FlinkForward SF 2017
- Work on things that give most return - better data and better questions
- Use the decoy (dummy) and canary (baseline/reference) servers
- Use the containerized (modular) framework
Artificial Intelligence in the Software Engineering Workflow by Peter Norvig at O'Reilly Artificial Intelligence Conference 2017 (subscription required)
Machine Learning, Technical Debt, and You by D. Sculley at PAPIs 2017
Webinar: Managing the Complete Machine Learning Lifecycle by Andy Konwinski (Databricks)

Presentation Slides

Production and Beyond by Rajat Arya and Alice Zheng at Turi
10 More Lessons Learned from Building Real-Life Machine Learning Systems by Xavier Amatriain at Dr.Assist
Standardizing the Machine Learning Lifecycle

Data Science Knowledge Sharing Process

Scaling Knowledge at Airbnb at AirBnB

Tools

Data Analytics Process Templates

CRISP-DM
- Four Problems in Using CRISP-DM and How To Fix Them by James Taylor on KDnuggets
SEMMA by SAS
Team Data Science Process by Microsoft
Cookiecutter Data Science Template by DrivenData

Knowledge Sharing

knowledge-repo by AirBnB

Data/Modeling/Deployment Pipelines

Pachyderm - Version control of data and language-agnostic data pipelines
Kubernetes - Automating deployment, scaling, and management of containerized applications
AWS Data Pipeline by Amazon - Data pipeline management on AWS
Azure Data Factory by Microsoft - Data pipeline management on Azure
AirFlow by AirBnB
Oryx - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
Luigi by Spotify
MLflow by Databricks - an open source platform for managing the end-to-end machine learning lifecycle.
- Infrastructure for the Complete ML Lifecycle by Matei Zaharia at Spark+AI Summit 2018
Kubeflow - An open-source model training, deployment, and serving tool leveraging the kubernetes ecosystem.
Polyaxon - Another open-source model training, deployment, and serving tool.
Algorithmia A SaaS solution for managing the complete ML workflow cycle.
Lore Framework to structure ML projects and pipelines

Model Management

Steam by H2O - Model management on H2O
Studio by studio.ml
Azure Model Management by Microsoft
dotscience

Open source:

ModelDB by MIT DB Group - Supports Spark ML and Scikit-Learn (Paper)
DVC: Version Control System for Machine Learning Projects

jeongyoonlee/data-science-process-management

jeongyoonlee

Reviews

Repository Details