Architectural patterns of resilient distributed systems
Accompanying repository for the "Architectural patterns of resilient distributed systems" talk given at Strangeloop 2015. Feel free to open any issues for questions and/or to say hi :)
Talk Outline
See the image credits, link to slides, and video.
- Why Resilience
- Motivation & Definitions
- Resilience Literature
- Harvest/Yield thinking
- Cook's Model
- Borrill's Model
- Resilience in industry
- Netflix
- Fastly
- Conclusions
- Back to the start
- Parting thoughts and rantifestos
References
Resilience literature
- Baller checklist on things to remember
- Harvest, Yield, and Scalable Tolerant Systems
- Computer Immunology - Burgess
- Building Robust Systems an essay - Sussman
- How Complex Systems Fail - Cook
- Optimal Design, Robustness, and Risk Aversion
- Part Count and Design of Robust Systems
- Highly Optimized Tolerance: A Mechanism for Power Laws in Designed Systems
- Fault Tolerance and the Five-Second Rule
- Scale free Networks - computerworld
- The Scale-free property - Barabási
- Scale-free network
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
- Failure Sketches: A Better Way to Debug
- Virtual Network Diagnosis as a Service
- ‘Going solid’: a model of system dynamics and consequences for patient safety
- Building on Quicksand
- Immutability Changes Everything
- You can't sacrifice partition tolerance
- Complex adaptive system
- Robustness principle
- Small-world experiment
Resilience in industry
- Fault tolerance in a high-volume distributed system
- From Chaos to Control - Testing the resiliency of Netflix’s Content Discovery Platform
- Making the Netflix API More Resilient
- Google Finds: Centralized Control, Distributed Data Architectures Work Better Than Fully Decentralized Architectures
- Clients are Jerks: aka How Halo 4 DoSed the Services at Launch & How We Survived
- Game Day Exercises at Stripe: Learning from kill -9
- How we ended up with microservices
- Postmortem for July 27 outage of the Manta service
- Hashicorp Yamux
- The Chubby lock service for loosely-coupled distributed systems
- Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
Media
- Velocity NY 2013: Richard Cook, "Resilience In Complex Adaptive Systems"
- Developing a Globally Distributed Purging System and slides
- Complex Adaptive Systems: 13 Robustness & Resilience
- Network Theory: 16 Robustness & Resilience
- Design of Resilient Systems - Innovations in Thinking Differently
- Camille Fournier's Papers We Love Talk on The Chubby lock service for loosely-coupled distributed systems and slides
- Scaling Networks through Software
Thank you!
Thank you to everyone who helped with feedback/resources and advice for this talk. Special thanks to: Paul Borrill, Jordan West, Caitie McCaffrey, Camille Fournier, Mike O'Neill, Neha Narula, Matt Whiteley, Joao Taveira, Tyler McMullen, Zac Duncan, Nathan Taylor, Ian Fung, Armon Dadgard, Peter Alvaro, Peter Bailis, Alex Rasmussen, Bruce Spang, Aysulu Greenberg, Elaine Greenberg, and Greg Bako.