Distributed Machine Learning Patterns
This repository contains references and code for the book Distributed Machine Learning Patterns from Manning Publications by Yuan Tang.
In Distributed Machine Learning Patterns you will learn how to:
- Apply patterns to build scalable and reliable machine learning systems.
- Construct machine learning pipelines with data ingestion, distributed training, model serving, and more.
- Automate machine learning tasks with Kubernetes, TensorFlow, Kubeflow, and Argo Workflows.
- Make trade off decisions between different patterns and approaches.
- Manage and monitor machine learning workloads at scale.
This book teaches you how to take machine learning models from your personal laptop to large distributed clusters. You’ll explore key concepts and patterns behind successful distributed machine learning systems, and learn technologies like TensorFlow, Kubernetes, Kubeflow, and Argo Workflows directly from a key maintainer and contributor. Real-world scenarios, hands-on projects, and clear, practical advice DevOps techniques and let you easily launch, manage, and monitor cloud-native distributed machine learning pipelines.
About the topic
Scaling up models from personal devices to large distributed clusters is one of the biggest challenges faced by modern machine learning practitioners. Distributing machine learning systems allow developers to handle extremely large datasets across multiple clusters, take advantage of automation tools, and benefit from hardware accelerations. In this book, Yuan Tang shares patterns, techniques, and experience gained from years spent building and managing cutting-edge distributed machine learning infrastructure.
About the book
Distributed Machine Learning Patterns is filled with practical patterns for running machine learning systems on distributed Kubernetes clusters in the cloud. Each pattern is designed to help solve common challenges faced when building distributed machine learning systems, including supporting distributed model training, handling unexpected failures, and dynamic model serving traffic. Real-world scenarios provide clear examples of how to apply each pattern, alongside the potential trade-offs for each approach. Once you’ve mastered these cutting-edge techniques, you’ll put them all into practice and finish up by building a comprehensive distributed machine learning system.
About the reader
For data analysts, data scientists, and software engineers familiar with the basics of machine learning algorithms and running machine learning in production. Readers should be familiar with the basics of Bash, Python, and Docker.
About the author
Yuan is a founding engineer at Akuity, building an enterprise-ready platform for developers. He has previously led data science and engineering teams at Alibaba and Uptake, focusing on AI infrastructure and AutoML platform. He's a project lead of Argo and Kubeflow, maintainer of TensorFlow and XGBoost, and author of numerous open source projects. In addition, Yuan authored three machine learning books and several impactful publications. He's a regular speaker at various conferences and a technical advisor, leader, and mentor at various organizations.
Supporting Quotes
"This is a wonderful book for those wanting to understand how to be more effective with Machine Learning at scale, explained clearly and from first principles!" -- Laurence Moroney, AI Developer Relations Lead at Google
"This book is an exceptionally timely and comprehensive guide to developing, running, and managing machine learning systems in a distributed environment. It covers essential topics such as data partitioning, ingestion, model training, serving, and workflow management. What truly sets this book apart is its discussion of these topics from a pattern perspective, accompanied by real-world examples and widely adopted systems like Kubernetes, Kubeflow, and Argo. I highly recommend it!" -- Yuan Chen, Principal Software Engineer at Apple
"This book provides a high-level understanding of patterns with practical code examples needed for all MLOps engineering tasks. This is a must-read for anyone in the field." -- Brian Ray, Global Head of Data Science and Artificial Intelligence at Eviden
"This book weaves together concepts from distributed systems, machine learning, and site reliability engineering in a way that’s approachable for beginners and that’ll excite and inspire experienced practitioners. As soon as I finished reading, I was ready to start building." -- James Lamb, Staff Data Engineer at SpotHero
"Whatever your role is in the data ecosystem (scientist, analyst, or engineer), if you are looking to take your knowledge and skills to the next level, then this book is for you. This book is an amazing guide to the concepts and state-of-the-art when it comes to designing resilient and scalable, ML systems for both training and serving models. Regardless of what platform you may be working with, this book teaches you the patterns you should be familiar with when trying to scale out your systems." -- Ryan Russon, Senior Manager of Model Training at Capital One