Data Science Process Management
This repository compiles resources for data science process management.
It is challenging to establish good process around data science tasks. Many data science tasks end up with one-off solutions that cannot be used in other tasks. Data science tasks don't fit well to either software engineering or product engineering process. Also different data science tasks require different processes.
Yet, it is crucial to have well defined process, because, without it, organizations cannot accumulate knowledge internally, and have to rely on individuals to perform critical tasks. People come and go, while process sticks around.
While there is no good standard so far, there have been active conversations around the issue.
By compiling those, we hope that it helps us get closer to the best practices in data science process management.
Understand Data Science Tasks
- Why Managing Data Scientists Is Different by Roger M. Stein at MIT
- Big Data โ From Descriptive to Prescriptive by Cameron Cramer
- The 4 Types of Data Analytics by Thomas Maydon, Principa
- Closing the Insights-to-Action Gap by Gary Robinson, Lytix.
- Thoughts on Managing Data Science Team Workstreams by Harlan Harris
Data Analytics Process
- What is the Workflow or Process of a Data Scientist by Ryan Fox Squire on Quora
Data Science Product Development Process
Use Cases
- Meson: Workflow Orchestration for Netflix Recommendations at Netflix Technical Blog
- Meet Michelangelo: Uberโs Machine Learning Platform at Uber Engineering Blog
- Bighead: Airbnbโs End-to-End Machine Learning Platform by Krishna Puttaswamy and Nick Handel
- Data science at eHarmony: A generalized framework for personalization by Jonathan Morra at Strata 2016
- Continuous Delivery for Machine Learning - Automating the end-to-end lifecycle of ML applications by Danilo Sato, Arif Wider, and Christoph Windheuser on Sep. 2019
Articles
- Meet Michelangelo: Uberโs Machine Learning Platform (blog post); also see scientific paper and talk video
- Data Science != Software Engineering by Domino Data Lab
- The Practical Guide to Managing Data Science at Scale by Domino Data Lab
- Rules of Machine Learning: Best Practices for ML Engineering by Martin Zinkevich at Google
- Distributed Time Travel for Feature Generation by Hossein Taghavi, Prasanna Padmanabhan, DB Tsai, Faisal Zakaria Siddiqi, and Justin Basilico at Netflix
- What's Your ML Test Score? A Rubric for ML Production Systems by Eric Breck et al. at Reliable Machine Learning in the Wild - NIPS 2016 Workshop
- Machine Learning: The High Interest Credit Card of Technical Debt by D. Sculley et al. at SE4ML - NIPS 2014 Workshop
- Machine Learning in Production by Szilard Pafka at Epoch (Video at Data Science LA meetup)
- Insights from a Predictive Model Pipeline Abstraction by Harlan Harris
- Moving Towards Managing AI Products - 10 Lessons for Building AI-Driven Products by Prasad Velamuri
Videos
- Non-Flink Machine Learning on Flink by Ted Dunning at FlinkForward SF 2017
- Work on things that give most return - better data and better questions
- Use the decoy (dummy) and canary (baseline/reference) servers
- Use the containerized (modular) framework
- Artificial Intelligence in the Software Engineering Workflow by Peter Norvig at O'Reilly Artificial Intelligence Conference 2017 (subscription required)
- Machine Learning, Technical Debt, and You by D. Sculley at PAPIs 2017
- Webinar: Managing the Complete Machine Learning Lifecycle by Andy Konwinski (Databricks)
Presentation Slides
- Production and Beyond by Rajat Arya and Alice Zheng at Turi
- 10 More Lessons Learned from Building Real-Life Machine Learning Systems by Xavier Amatriain at Dr.Assist
- Standardizing the Machine Learning Lifecycle
Data Science Knowledge Sharing Process
- Scaling Knowledge at Airbnb at AirBnB
Tools
Data Analytics Process Templates
- CRISP-DM
- Four Problems in Using CRISP-DM and How To Fix Them by James Taylor on KDnuggets
- SEMMA by SAS
- Team Data Science Process by Microsoft
- Cookiecutter Data Science Template by DrivenData
Knowledge Sharing
- knowledge-repo by AirBnB
Data/Modeling/Deployment Pipelines
- Pachyderm - Version control of data and language-agnostic data pipelines
- Kubernetes - Automating deployment, scaling, and management of containerized applications
- AWS Data Pipeline by Amazon - Data pipeline management on AWS
- Azure Data Factory by Microsoft - Data pipeline management on Azure
- AirFlow by AirBnB
- Oryx - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
- Luigi by Spotify
- MLflow by Databricks - an open source platform for managing the end-to-end machine learning lifecycle.
- Infrastructure for the Complete ML Lifecycle by Matei Zaharia at Spark+AI Summit 2018
- Kubeflow - An open-source model training, deployment, and serving tool leveraging the kubernetes ecosystem.
- Polyaxon - Another open-source model training, deployment, and serving tool.
- Algorithmia A SaaS solution for managing the complete ML workflow cycle.
- Lore Framework to structure ML projects and pipelines
Model Management
- Steam by H2O - Model management on H2O
- Studio by studio.ml
- Azure Model Management by Microsoft
- dotscience
Open source:
- ModelDB by MIT DB Group - Supports Spark ML and Scikit-Learn (Paper)
- DVC: Version Control System for Machine Learning Projects