How they SRE
A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
Introduction
How They SRE is a curated knowledge repository of best practices, tools, techniques, and culture of SRE adopted by the leading technology or tech-savvy organizations.
Many organizations regularly come forward and share their best practices, tools, techniques and offer an insight into engineering culture on various public platforms like engineering blogs, conferences & meetups. The content is curated from these avenues and shared in this repository.
Note to readers: This list refers to some of the articles, posts, videos, tools, and techniques published before 2015. Please use such material with caution as there may be recent advances in technology and practices which offer better alternatives and perspectives.
Topics
- Site Reliability Engineering
- Hiring and Building SRE teams
- SRE Culture
- DevOps
- Monitoring & Observability
- Alerting
- Incident Response & Post-Mortem
- On-Call
- Testing in Production
- Chaos Engineering
- Automation
- Performance
Organizations
Achievers
Blog Posts
Airbnb
Blog Posts
Alibaba Cloud
Blog Posts
Asana
Blog Posts
ASOS
Blog Posts
- Playing the blame-less game
- A day in the life of… Cat S (Head of Reliability Engineering)
- An AKS Performance Journey: Part 1 — Sizing Everything Up
- An AKS Performance Journey: Part 2 — Networking It Out
- Cyber Security @ ASOS.com
- Security Operations 24x7
- The skills we look for in Cyber Security Incident Response
Atlassian
Blog Posts
Baidu
Videos
Basecamp
Blog Posts
- Inside a CODE RED: Network Edition
- Three Basecamp outages. One week. What happened?
- Basecamp 2 and Basecamp 3 search outage report
- Reducing Incident Escalations at Basecamp
Books
Bloomberg
Videos
- Capacity Planning and Performance Enhancement with Page Reference Sampling
- Why SREs can't afford to NOT do Chaos Engineering
- Tracing Real-Time Distributed Systems
- The Bloomberg Story: Building SRE Teams in an "Immeasurable" Organisation
- Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest
Booking.com
Blog Posts
- How Reliability and Product Teams Collaborate at Booking.com
- Incidents, fixes, and the day after
- Troubleshooting: A journey into the unknown
Videos
Capital One
Blog Posts
- Automate Application Monitoring with Slack
- Automate AWS Infrastructure with Boto 3: AWS Health Check
- Active-Active Shared-Nothing Database Architecture
- The 3 R’s of SREs: Resiliency, Recovery & Reliability
- 5 Steps to Getting Your App Chaos Ready
- 4 Real-World Scenarios That Read Like Chaos Engineering Experiments
- Embrace the Chaos … Engineering
- 3 Lessons Learned From Implementing Chaos Engineering at Enterprise
- A Deep Dive Into Seamless Blue/Green Deployment Using AWS CodeDeploy
- Secure Docker Containers Require Secure Applications
- 4 Steps for Pairing the Cloud and DevOps to Improve Resiliency
- Container Ready Applications with Twelve-Factor App and Microservices Architecture
- Deploying with Confidence — Minimize Risk, Maximize Resiliency With Canary Deployments on AWS
- Architecting for Resiliency
- Continuous Chaos — Introducing Chaos Engineering into DevOps Practices
- The Mon-ifesto Part 1: Metrics
Major incidents & analysis reports
Videos
DBS
Blog Posts
- Presenting at iThome’s SRE Conference: Our DBS SRE Transformation Journey Thus Far
- Debunking the seven most popular Site Reliability Engineering myths
- How To Use SRE To Cultivate A Blameless Culture In The Workplace
- Site Reliability Engineering at DBS Bank
- Automating Configuration Management at Scale
- How DBS dispelled the myths of Chaos Engineering
- Double, Double Toil and Trouble
Videos
DeepSource
Blog Posts
Dream11
Blog Posts
- Deployment At Scale: Story Behind Dream11’s In-House Blue-Green Deployment Platform ‘OneClick’.
- Enhancing security and trust with AWS WAFv2
- Lessons learned from running GraphQL at scale
- Break circuits, save Kong 🦍
- Finding Order in Chaos: How We Automated Performance Testing with Torque
- Maintaining hyper-sonic releases at Dream11
- To Scale In Or Scale Out? Here’s How We Scale at Dream11
- Building Scalable Real Time Analytics, Alerting and Anomaly Detection Architecture at Dream11
Dropbox
Blog Posts
- Dropbox Engineering Career Framework - Reliability Engineer (SRE)
- Atlas: Our journey from a Python monolith to a managed platform
- Monitoring server applications with Vortex
- Athena: Our automated build health management system
Videos
eBay
Blog Posts
- Resiliency and Disaster Recovery with Kafka
- SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue
- SRE Case Study: Mysterious Traffic Imbalance
- Zero Downtime, Instant Deployment and Rollback
Video
Etsy
Blog Posts
- Improving the Deployment Experience of a Ten-Year Old Application
- How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020
- Your brain on progress
- Etsy’s Debriefing Facilitation Guide for Blameless Postmortems
- Opsweekly: Measuring on-call experience with alert classification
- Demystifying Site Outages
- Blameless PostMortems and a Just Culture
- Measure Anything, Measure Everything
Videos
Expedia
Blog Posts
- Automating Performance Standards
- Error Budget Policy - Part 1 - Adoption at Expedia Group
- Error Budget Policy - Part 2 - Practices at Expedia Group
- Using Fault-Injection to Improve our new Runtime Platform’s Reliability
- Learning from Incidents at Expedia Group
- Improving Vrbo Homepage Loading Experience
- Troubleshooting 502 errors: ECS Checklist
- Getting Started with Elasticsearch
- All about ISTIO-PROXY 5xx Issues
- Autoscaling in Kubernetes: Why doesn’t the Horizontal Pod Autoscaler work for me?
- How to Keep Your Kubernetes Deployments Balanced Across Multiple zones
- Are Your Dropwizard Latency Metrics Misleading You?
- The Cost of 100% Reliability
- Creating Monitoring Dashboards
- Using Bash for DevOps
Fastly
Videos
Getaround
Blog Posts
GitHub
Blog Posts
Major incidents & analysis reports
- GitHub Availability Report: December 2021
- GitHub Availability Report: November 2021
- GitHub Availability Report: October 2021
- GitHub Availability Report: September 2021
- GitHub Availability Report: August 2021
- GitHub Availability Report: July 2021
- GitHub Availability Report: June 2021
- GitHub Availability Report: May 2021
- GitHub Availability Report: April 2021
- GitHub Availability Report: March 2021
- GitHub Availability Report: February 2021
- GitHub Availability Report: January 2021
- GitHub Availability Report: December 2020
- GitHub Availability Report: November 2020
- GitHub Availability Report: August 2020
- GitHub Availability Report: July 2020
- Introducing the GitHub Availability Report
- February service disruptions post-incident analysis
- October 21 post-incident analysis
- February 28th DDoS Incident Report
- Incident Report: Inadvertent Private Repository Disclosure
Videos
GitLab
Blog Posts
- This SRE attempted to roll out an HAProxy config change. You won't believe what happened next...
- My week shadowing a GitLab Site Reliability Engineer
- Update: Elasticsearch lessons learnt for Advanced Global Search
- Lessons in iteration from a new team in infrastructure
- How we optimized infrastructure spend at GitLab
- How we scaled async workload processing at GitLab.com using Sidekiq
- Inside GitLab: How we release software patches
- What tracking down missing TCP Keepalives taught me about Docker, Golang, and GitLab
- How we used delayed replication for disaster recovery with PostgreSQL
GoCardless
Blog Posts
- Deploying Software at GoCardless: Open-Sourcing our “Getting Started” Tutorial
- How we compress Pub/Sub messages and more, saving a load of money
- Fear-free PostgreSQL migrations for Rails
- Observability at GoCardless: a tale of API performance improvement
- Debugging the PostgreSQL query planner
- Zero-downtime Postgres migrations - the hard parts
- In search of performance - how we shaved 200ms off every POST request
Major incidents & analysis reports
GoDaddy
Blog Posts
Gojek
Blog Posts
Goldman Sachs
Blog Posts
Blog Posts
- Pitfalls and Patterns in Microservice Dependency Management
- SRE Practices & Processes
- Google site reliability using Go
- Three months, 30x demand: How we scaled Google Meet during COVID-19
- SRE Classroom: Distributed PubSub
- How SRE teams are organized, and how to get started
Videos
- What's the Difference Between DevOps and SRE? with Seth Vargo and Liz Fong-Jones of Google
- Risk and Error Budgets’ with Seth Vargo and Liz Fong-Jones of Google
- Pragmatic Automation’ with Max Luebbe of GCP
- Must Watch! - Google SRE YouTube Playlist
- Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit
- Implementing Distributed Consensus
- The SRE I Aspire to Be
- SRE Classroom, Or, How to Design a Reliable Distributed System in 3 Hours
- Zero Touch Prod: Towards Safer and More Secure Production Environments
- All of Our ML Ideas Are Bad (and We Should Feel Bad)
- The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It
- Deploying SRE Training Best Practices to Production: How We SRE'ed Our SRE Education Program
- Bigtable: A Journey from Binary to Service and the Lessons Learned along the Way
- Practical Instrumentation for Observability
- What Is ML Ops: Solutions and Best Practices for DevOps of Production ML Services
- Unified Reporting of Service Reliability
- How to Trade off Server Utilization and Tail Latency
- Keeping the Balance: Internet-Scale Loadbalancing Demystified
- From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services
- Mindfulness in SRE: Monitoring and Alerting for One's Self
- Pragmatic Automation
- Sublinear Scaling in Practice: The 1k SRE Project
- Strategies to Edit Production Data
- The Curse of SRE Autonomy and How to Manage It
- Scaling SRE Organizations: The Journey from 1 to Many Teams
- SRE Classroom - How to Design a Distributed System in 3 Hours
- Using PRDs and User Journeys to Design User-Friendly Tools
- How Google SRE and Developers Work Together
- SREcon21 - Experiments for SRE
Grab
Blog Posts
- Our Journey to Continuous Delivery at Grab (Part 1)
- Our Journey to Continuous Delivery at Grab (Part 2)
- Designing Resilient Systems: Circuit Breakers or Retries? (Part 1)
- Designing Resilient Systems: Circuit Breakers or Retries? (Part 2)
- Designing Resilient Systems Beyond Retries (Part 3): Architecture Patterns and Chaos Engineering
- Orchestrating Chaos using Grab's Experimentation Platform
- How We Designed the Quotas Microservice to Prevent Resource Abuse
- How We Scaled Our Cache and Got a Good Night's Sleep
Grammarly
Blog Posts
Heroku
Blog Posts
Indeed
Blog Posts
- Indeed SRE: An Inside Look
- Being Just Reliable Enough
- Automating Indeed’s Release Process
- Sloth, a Tool for Inducing Network Failures’ with Preetha Appan of Indeed.com
Videos
Khan Academy
Blog Posts
Blog Posts
- Rethinking site capacity projections with Capacity Analyzer
- Insights into a Product SRE team at LinkedIn
- Hiring SREs at LinkedIn
- Open source update: School of SRE
- Fixing Linux filesystem performance regressions
- Production testing with dark canaries
- Smart alerts in ThirdEye, LinkedIn’s real-time monitoring platform
- Iris mobile: An open source, mobile interface for incident management
- LinkedOut: A Request-Level Failure Injection Framework
- Eliminating toil with fully automated load testing
- The Makeup of Successful Geographically-Distributed SRE Teams: Part 1
- The Makeup of Successful Geographically-Distributed SRE Teams: Part 2
- Project STAR*: Streamlining Our On-Call Process
- Automating Your Oncall: Open Sourcing Fossor and Ascii Etch
- Resilience Engineering at LinkedIn with Project Waterbear
- Hiring SREs at LinkedIn, 2017
- Open Sourcing Iris and Oncall
- Building the SRE Culture at LinkedIn
- Failure is Not an Option
- MTTD and MTTR Are Key
- What Gets Measured Gets Fixed
Videos
- Growing the Site Reliability Team at LinkedIn: Hiring is Hard -- Greg Leffler
- 9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE
- Weathering the Storm: How Early Warnings Save the Farm
- Unconference: Unsolved Problems in SRE
- Leading without Managing: Becoming an SRE Technical Leader
- Why Does (My) Monitoring Suck?
- Traffic Forecasting and Stress Testing Infrastructure
- Collective Mindfulness for Better Decisions in SRE
- TCP—Architecture, Enhancements, and Tuning
- Over 600 Million Members and Hundreds of Micro Services: How We Scaled Our Monitoring System to Keep up
- Understanding Business Metrics Can Make You a Better SRE
- Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way
- Differences in SRE Implementations across Companies
Tools
Loveholidays
Blog Posts
- Dynamic alert routing with Prometheus and Alertmanager
- Making loveholidays 18% faster with HTTP/3
- Enforcing best practice on self-serve infrastructure with Terraform, Atlantis and Policy As Code
- The 5 principles that helped scale loveholidays
- Realtime Fastly logs with Grafana Loki for under $1 a day
Macquarie
Blog Posts
Mattermost
Blog Posts
Mercari
Blog Posts
- Who Watches the Watchmen? Keeping an Eye on Our Monitoring Systems
- What the Microservices SRE Team are doing as SRE Evangelists
- What it’s like to work as an embedded microservices SRE
- The Merpay SRE Team: Past and future
- Embedded SRE at Mercari
- What the SRE team wants to achieve with the development team
- DevSecOps: What Is It and Why Is It Gaining Momentum in the Industry?
- How do we share troubleshooting skills
- Datadog Dashboard at Scale w / Terraform
Meta
Blog Posts
- Improving Meta’s SLO workflows with data annotations
- SLICK: Adopting SLOs for improved reliability
- More details about the October 4 outage
- Update about the October 4th outage
Videos
Microsoft
Videos
- SLI & Reliability Deep-Dive’ with David N. Blank-Edelman of Microsoft
- Ironies of Automation: A Comedy in Three Parts’ with Tanner Lund of Microsoft
- Sustainable Software Engineering & SREs
- Study on Human Factors and Team Culture to Improve Pager Fatigue
- Prioritizing Trust While Creating Applications
- Building Resilience: How to Learn More from Incidents
- A Tale of Two Postmortems: A Human Factors View
- Availability—Thinking beyond 9s
- Ironies of Automation: A Comedy in Three Parts
- The Ops in Serverless
MIRO
Blog Posts
Monzo
Blog Posts
- Autoscaling Monzo: How we optimise our platform to be just the right size
- How we’ve evolved on-call at Monzo
- How we respond to incidents
- How we monitor Monzo
Videos
Tools
Netflix
Blog Posts
- Achieving observability in async workflows
- Building Netflix’s Distributed Tracing Infrastructure
- Lessons from Building Observability Tools at Netflix
- Edgar: Solving Mysteries Faster with Observability
- Telltale: Netflix Application Monitoring Simplified
- Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix
- Introducing Dispatch
- Applying Netflix DevOps Patterns to Windows
- ChAP: Chaos Automation Platform
- Starting the Avalanche
- Netflix Chaos Monkey Upgraded
- Chaos Engineering Upgraded
- Automated Failure Testing
- From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform
- Introducing Atlas: Netflix’s Primary Telemetry Platform
- FIT: Failure Injection Testing
- Announcing Security Monkey — AWS Security Configuration Monitoring and Analysis
- Lessons Netflix Learned from the AWS Outage
- Scryer: Netflix’s Predictive Auto Scaling Engine
Major incidents & analysis reports
Videos
- AWS re:Invent 2019: A day in the life of a Netflix engineer (NFX202)
- When /bin/sh Attacks: Revisiting "Automate All the Things"
- How Did Things Go Right? Learning More from Incidents
- Monitoring and Tracing @Netflix Streaming Data Infrastructure
- Real user performance monitoring at Netflix scale ‐ Martin Spier
- AWS re:Invent 2017 - Nora Jones Describes Why We Need More Chaos - Chaos Engineering, That Is
- AWS re:Invent 2017: Performing Chaos at Netflix Scale (DEV334)
- Netflix: Multi-Regional Resiliency and Amazon Route 53
- Designing Services for Resilience: Netflix Lessons
- South Bay SRE Meetup - Netflix Cloud Performance Team
- AWS re:Invent 2017: A Day in the Life of a Netflix Engineer III (ARC209)
- How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows
- Mastering Chaos - A Netflix Guide to Microservices
- AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global Architecture (ARC204)
- SREcon 2016 - Netflix: 190 Countries and 5 CORE SREs
- From Sys Admin to Netflix SRE
- Application Resilience Engineering and Operations at Netflix with Hystrix
- Injecting Failure at Netflix
- LISA13 - How Netflix Embraces Failure to Improve Resilience and Maximize Availability
- Incident Management at Netflix Velocity
Podcasts
Tools
New Relic
Blog Posts
- Defining Modern Software Roles: SREs at New Relic
- 10 Things Everybody Needs to Know About Site Reliability Engineering (SRE)
- What Tools Do Site Reliability Engineers Use?
- A Day in the Life of a New Relic SRE
- 7 Habits of Highly Successful Site Reliability Engineers
- Adopting the practice of SRE
- Using modern observability to establish a data-driven culture
OpenAI
Blog Posts
PayPal
Blog Posts
- Triggered: Incident #1234 (incident process needs fixing)
- Implementing Observability in a Service Mesh
- PostgreSQL at Scale: Database Schema Changes Without Downtime
- Scaling GraphQL at PayPal
Videos
- SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal
- SRE Then vs SRE Now: A Balancing Act between Reflexes and Intuitive Instincts at PayPal
- Detecting Service Degradation and Failures at Scale through Distributed Log Processing
- Operating Elasticsearch with Ease at Scale
- Ensuring Site Reliability through Security Controls
Picnic
Blog Posts
Blog Posts
- Ensuring High Availability of Ads Realtime Streaming Services
- Improving efficiency and reducing runtime using S3 read optimization
- Scaling Kubernetes with Assurance at Pinterest
- What we learned from an iOS app OOMs incident
- How we designed our Continuous Integration System to be more than 50% Faster
- Simplifying web deploys
- Upgrading Pinterest operational metrics
- Distributed tracing at Pinterest with new open source tools
- Auto scaling Pinterest
Videos
Prezi
Blog Posts
Red Hat
Blog Posts
Riot Games
Blog Posts
- THE LEGENDS OF RUNETERRA CI/CD PIPELINE
- STRATEGIES FOR WORKING IN UNCERTAIN SYSTEMS
- IMPROVING THE DEVELOPER EXPERIENCE FOR OPERATING SERVICES
- SCALABILITY AND LOAD TESTING FOR VALORANT
- LEVERAGING GOLANG FOR GAME DEVELOPMENT AND OPERATIONS
- CONTROLLED CHAOS WITH FAULT INJECTION TESTING
- DOWN THE RABBIT HOLE OF PERFORMANCE MONITORING
- PROFILING: THE CASE OF THE MISSING MILLISECONDS
- PROFILING: REAL WORLD PERFORMANCE IN LEAGUE
- PROFILING: OPTIMISATION
- PROFILING: MEASUREMENT AND ANALYSIS
- RUNNING ONLINE SERVICES AT RIOT: PART I
- RUNNING ONLINE SERVICES AT RIOT: PART II
- RUNNING ONLINE SERVICES AT RIOT: PART III
- RUNNING ONLINE SERVICES AT RIOT: PART III: PART DEUX
- RUNNING ONLINE SERVICES AT RIOT: PART IV
- RUNNING ONLINE SERVICES AT RIOT: PART V
- THE EVOLUTION OF SECURITY AT RIOT
- RUNNING AN AUTOMATED TEST PIPELINE FOR THE LEAGUE CLIENT UPDATE
- AUTOMATED TESTING FOR LEAGUE OF LEGENDS
Salesforce
Blog Posts
- Looking at the Kubernetes Control Plane for Multi-Tenancy
- Optimizing EKS networking for scale
- Zero Downtime Node Patching in a Kubernetes Cluster
- How, Not Why: An Alternative to the Five Whys for Post-Mortems
- A Generic Sidecar Injector for Kubernetes
- Implementation of a monitoring strategy for products based on microservices
- 10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use
- Our Journey to a Near Perfect Log Pipeline
- Optimizing Performance with Web Workers
- Take A Moment To Refocus
Scribd
Blog Posts
Shopify
Blog Posts
- Resiliency Planning for High-Traffic Events
- Capacity Planning at Scale
- Using DNS Traffic Management to Add Resiliency to Shopify’s Services
- Four Steps to Creating Effective Game Day Tests
- Implementing ChatOps into our Incident Management Procedure
- StatsD at Shopify
Videos
Sky Betting and Gaming
Blog Posts
Slack
Blog Posts
- Slack’s Incident on 2-22-22
- Infrastructure Observability for Changing the Spend Curve
- Slack’s Outage on January 4th 2021
- A Terrible, Horrible, No-Good, Very Bad Day at Slack
- Deploys at Slack
- Disasterpiece Theater: Slack’s process for approachable Chaos Engineering
Videos
Slalom Build
Blog Posts
- How to Implement Service Level Objectives in New Relic APM
- Beginners Guide to DevOps: How to Make It into the Industry
- GitHub Actions: Beyond CI/CD
- Why isn’t all test automation run on the pipeline?
- The Many Shapes of Site Reliability Engineering
- How to build a secure by default Kubernetes cluster with a basic CI/CD pipeline on AWS
- Secret Management Architectures: Finding the balance between security and complexity
- Detecting Malicious Requests with Keras & Tensorflow
- The Lego Monolith — A Monolith Microservice Proof of Concept
- Managing Secrets Using Hashicorp Vault
- Packaging Spring Boot Applications for Deployment on Kubernetes
- Immutable Infrastructure and Continuous Delivery in the Cloud
Soundcloud
Blog Posts
Spotify
Blog Posts
- Matt Clarke: Senior Backend Infrastructure Engineer
- Designing a Better Kubernetes Experience for Developers
- Techbytes: What The Industry Misses About Incidents and What You Can Do
- Automated Incident Response Infrastructure in GCP
Videos
Stack Overflow
Blog Posts
- “This should never happen. If it does, call the developers.”
- Infrastructure as code: Create and configure infrastructure elements in seconds
- Fulfilling the promise of CI/CD
- A deeper dive into our May 2019 security incident
- Guest Post - Failing over without falling over
- How We Built Our Blog
- Stack Overflow Frees Up Engineering Time with Netlify
Videos
Strava
Blog Posts
Stripe
Blog Posts
- Fast and flexible observability with canonical log lines
- Fast builds, secure builds. Choose two.
- Introducing Veneur: high performance and global aggregation for Datadog
Videos
Target
Blog Posts
Tinder
Blog Posts
Tokopedia
Blog Posts
Blog Posts
- Logging at Twitter: Updated
- Deleting data distributed throughout your microservices architecture
- Deterministic Aperture: A distributed, load balancing algorithm
- MetricsDB: TimeSeries Database for storing metrics at Twitter
- The Infrastructure Behind Twitter: Scale
- The infrastructure behind Twitter: efficiency and optimization
Uber
Blog Posts
- Founding Uber SRE
- Disaster Recovery for Multi-Region Kafka at Uber
- Engineering Failover Handling in Uber’s Mobile Networking Infrastructure
- Optimizing Observability with Jaeger, M3, and XYS at Uber
Videos
upGrad
Blog Posts
Wikimedia Foundation
Videos
Wix
Blog Posts
Zalando
Blog Posts
SRECon Mix Playlist
Videos
- Adobe - The Good, the Bad and the Ugly: The 3 Learnings of an SRE
- Amdocs - SREs at Telecom and Media Industry: Bridging between Legacy and Cloud Native Apps
- Amazon - Confessions of a Systems Engineer: Learning from My 20+ Years of Failure
- Alaska Airlines - Capacity Prediction in External Services
- BuzzFeed - Optimizing for Learning
- BT - Challenges of Starting an SRE Team from Scratch in an Enterprise
- Cloudflare - Support Operations Engineering: Scaling Developer Products to the Millions
- Hudson River Trading - Fixing On-Call When Nobody Thinks It's (Too) Broken
- IBM - Why Automating Everything Adds to Your Toil
- Genesys - The Smallest Possible SRE Team
- G-Research - My Life as a Solo SRE
- Grafana Labs - SRE in the Third Age
- Kenna Security - Building a Scalable Monitoring System
- Lightstep - Building Service Ownership Using Documentation, Telemetry, and a Chance to Make Things Better
- MessageBird - Autopsy of a MySQL Automation Disaster
- Netlify - Perks and Pitfalls of Building a Remote First Team
- ReactiveOps - Zero to SRE
- Salesforce - Incident Response in Unfamiliar Sociotechnical Systems: One Incident Commander's Challenges Supporting Inter-organizational Anomaly Response in the Age of COVID-19
- Sprax - From Nothing to SRE: Practical Guidance on Implementing SRE in Smaller Organisations
- The New York Times - SRE by Influence, Not Authority: How the New York Times Prepares for Large-Scale Events
- Twitter - Hiring Great SREs
- United States Digital Service - Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Team's Value
- Unity Technologies - Being Reasonable about SRE
- Udemy - How to Do SRE When You Have No SRE
- Vanguard - Cloudy with a Chance of Chaos
- WeWork - Learning from Learnings: Anatomy of Three Incidents
- Zendesk - Latency and Availability Error Budgets Done Right at Scale
Resources
Books
- New! Enterprise Roadmap to SRE
- Building Secure & Reliable Systems | Read free online version hosted by Google
- Site Reliability Engineering | Read free online version hosted by Google
- The Site Reliability Workbook from Google | Read free online version hosted by Google
- Training Site Reliability Engineers | Read free online version hosted by Google
- 97 Things Every SRE Should Know | Complimentary Copy from Nginx
- SLO Adoption and Usage in Site Reliability Engineering
- Practical Site Reliability Engineering
- Implementing Service Level Objectives
- Chaos Engineering
- Seeking SRE
- Security Chaos Engineering
- Chaos Engineering Observability
- Database Reliability Engineering
- What Is SRE?
- Database Reliability Engineering: What, Why, and How?
- Observability Engineering
- Chaos Engineering: Site reliability through controlled disruption
- Incident Metrics in SRE | Read free online version hosted by Google
- Engineering Reliable Mobile Applications
- Monitoring the SRE Golden Signals
- Site Reliability Engineering: Philosophies, habits, and tools for SRE success | Portable version
- 97 Things Every Cloud Engineer Should Know
- Real-World SRE
- Hands-on Site Reliability Engineering
Events
Other Resources
Awesome Lists
- Awesome SRE
- Awesome Site Reliability Engineering Tools
- Awesome Chaos Engineering
- Awesome Monitoring
- Awesome Observability
- Awesome MLOps
- ML-Ops.org
SRE Resources from various organizations
- Google SRE Page
- Google SRE Classroom
- Google Cloud SRE Page
- Microsoft SRE Page
- School of SRE from LinkedIn
- Stripe Increment Magazine Issue 16 on Reliability
- AWS Observability Recipes
- Awesome Sysadmin
Incidents & postmortems
- The Verica Open Incident Database
- Postmortem Templates
- Incident Review and Postmortem Best Practices
Newsletters
Credits
- Inspired by Howtheytest from Abhijeet Vaikar
- The list of organizations is referred from my other repo awesome-engineering
- Banner image Cartoon vector created by vectorjuice - www.freepik.com
Other How They... repos
Contribute
Contributions welcome! Read the contribution guidelines first.
License
To the extent possible under law, Unmesh Gundecha has waived all copyright and related or neighboring rights to this work.
If you decide to use this anywhere please give a credit to @upgundecha on twitter, also If you like my work, check out other projects on my Github.