• Stars
    star
    11,476
  • Rank 2,767 (Top 0.06 %)
  • Language
  • License
    Creative Commons ...
  • Created about 8 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A curated list of Site Reliability and Production Engineering resources.

Awesome Site Reliability Engineering Awesome

A curated list of awesome Site Reliability and Production Engineering resources.

What is Site Reliability Engineering?

"Fundamentally, it's what happens when you ask a software engineer to design an operations function." - Ben Treynor Sloss, VP Google Engineering, founder of Google SRE

Contributing

Please take a look at the contribution guidelines first. Contributions are always welcome!

Contents

Culture

Education

Books

Hiring

Reliability

Monitoring & Observability & Alerting

On-Call

Post-Mortem

Capacity Planning

Service Level Agreement

Performance

Programming

Misc Articles

Real-time Messaging

Blogs

  • Brendan Gregg's Blog - Highly Technical Blog Posts About Systems Internals, Performance and SRE.
  • Everything Sysadmin - Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli.
  • High Scalability - Technical Blog Posts About Systems Architecture.
  • rachelbythebay - Techincal Blog Posts.
  • Susan J. Fowler - Various blog posts about SRE, Software Engineering and Microservices.
  • SysAdvent - One article for each day of December, ending on the 25th article.
  • Stephen Thorne's Blog - Blog Posts About SRE
  • Increment - A digital magazine about how teams build and operate software systems at scale.
  • GopherSRE - Blog Posts about Go and SRE.
  • Cindy Sridharan - Blog posts about distributed systems and their management.
  • Blameless Blog - Blog posts about SRE culture and practices.
  • Resilience Roundup - Weekly analysis of Resilience Engineering and Human Factors research designed for software systems
  • Squadcast Blog - Blog posts about SRE best practices, reliability, on-call and incident management.
  • FireHydrant Blog - Posts about complex systems, incident response, and SRE best practices.
  • Rootly Blog - Incident management best practices and guides.
  • incident.io Blog - Guides, advice and resources on incident management and response.
  • Logit.io Blog - Resources on log management, SRE and devOps.

Newsletters

  • DevOpsLinks - A weekly newsletter about SRE, SysAdmin and DevOps news, tools, tutorials and opinions.
  • KubeWeekly - The weekly newsletters for all things Kubernetes. KubeWeekly is curated by Bob Killen, Chris Short, Craig Box, Kim McMahon and Michael Hausenblas
  • SRE Weekly - Weekly Site Reliability Newsletter.
  • O’Reilly Systems Engineering and Operations Newsletter - Weekly systems engineering and operations news and insights from industry insiders.
  • ChaosEngineering.news - Chaos Engineering newsletter. All things Chaos Engineering, directly to your inbox!
  • Monitoring Weekly - What's new in monitoring? Curated monitoring articles to your inbox each week.
  • Observability news - Updates around observability (o11y) with a special focus on open source.

Conferences & Meetups

Twitter

SRE Tools

Podcasts

More Repositories

1

awesome-chaos-engineering

A curated list of Chaos Engineering resources.
5,779
star
2

postmortem-templates

A collection of postmortem templates
1,208
star
3

wheel-of-misfortune

A role-playing game for incident management training
HTML
158
star
4

availability-calculator

Calculate how much downtime should be permitted in your Service Level Agreement or Objective
HTML
63
star
5

kubectl-janitor

List Kubernetes objects in a problematic state
Go
54
star
6

sreworkbook-templates-md

A collection templates ported from the SRE Workbook
35
star
7

vegeta-operator

Kubernetes Operator for running HTTP load testing scenarios with Vegeta
Go
32
star
8

common-disaster-recovery-scenarios

A list of common Disaster Recovery (DR) scenarios for software companies
31
star
9

tc-panel

Geo-Distributed Infrastructure Emulation using Traffic Shaping
Python
12
star
10

ansible-rpi-cluster

Automate common tasks in your Raspberry Pi cluster with Ansible
9
star
11

strgz

CLI tool for listing and searching users' starred repositories on Github
Go
7
star
12

tidyman

Script for tiding files
Shell
5
star
13

oomutil

A Go package with read-only operations for determining the Out-Of-Memory (OOM) status of a process on Linux
Go
5
star
14

error-budget-calculator

Calculate the tolerable downtime of your service
HTML
5
star
15

py-deterministic-subsetting

Deterministic Subsetting as defined in the SRE book
Python
4
star
16

tmux-load-avg

tmux plugin that displays the system load average in the last 1, 5 and 15 minutes.
Shell
4
star
17

fstree

A tool that generates a depth indented listing of files and sub-directories in a tree-like format.
Go
3
star
18

golang-timemap

Time-based key-value store for Go
Go
3
star
19

proctree

A tool to display a tree of running processes
Go
3
star
20

gopheracademy-advent2019-tcp-no-delay

Material from my GopherAcademy Advent 2019 blog post about TCP_NODELAY and Go
Go
3
star
21

dastergon.github.io

Personal website
HTML
2
star
22

rawdog-list-authors

rawdog plugin to display authors list
2
star
23

sremuc

Site Reliability Engineering Munich Meetup Page
CSS
1
star
24

rawdog

HTML
1
star
25

venus2rawdog

Command line utility to migrate config files from RSS aggregator venus to rawdog
Python
1
star
26

sampler-recipes

A list of sampler recipes to run on the CLI
1
star