• Stars
    star
    157
  • Rank 238,399 (Top 5 %)
  • Language
  • Created about 6 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A vocabulary collection for SREs

SRE-cheat-sheet

A cheatsheet for SREs (mostly influenced by Google SREs). It is meant as a landing page to quickly look up a certain keyword. If you want to go more into the details, I suggest you read the Google SRE book. It is for free: https://landing.google.com/sre/book/

Dictionary

Site Reliability Engineering

"Fundamentally, it’s what happens when you ask a software engineer to design an operations function" -- Ben Treynor, VP of Engineering @ Google.1

Uptime

Availability % Downtime per year Downtime per month Downtime per Week
90% 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
98% 7.30 days 14.4 hours 3.36 hours
99% 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 minutes
99.8% 17.52 hours 86.23 minutes 20.16 minutes
99.9% 8.76 hours 43.2 minutes 10.1 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes
99.99% 52.6 minutes 4.32 minutes 1.01 minutes
99.999% 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% 31.5 seconds 2.59 seconds 0.605 seconds

Downtime per month is calculated at 30 days.2

Error Budget

Error budget is generall the budget you can spend on pushing features. Let's say you have an uptime of 90% for your application or service. This means that you can have a downtime of 36.5 days per year, this is a downtime of 72 hours per month. You can either spend this downtime on fixing errors or you build your system reliable and spend it on pushing new features. It's fully up to you. You should just make sure that you freeze new features until your error budget has recovered. This has several advantages:

  1. Your Software Engineers will try to build your application as much as stable. Because if your application is unstable they will need their error budget for fixing these errors instead of pushing new features.
  2. If you have a stable application you are free to push new features as much as your error budget allows you to.
  3. Your uptime will be consistent to your SLA. Nobody wants to get sued, trust me.

Dickerson's Hierarchy of Service Reliability

Four Golden Signals

A group of basic questions about your service regarding monitoring.5

Saturation

This definition is up to you. It can be the capacity of the service like the CPU utilization. Ask yourself at what point your service could fall over and try to measure the metric for that point.

Latency

Users expect blazing fast apps these days. So you want to definitly monitor your latency. At Google they measure latency in three numbers:

  • P50: The 50th percentile or the median latency.
  • P90: The 90th percentile.
  • P99: The 99th percentile.

Do not see latency as an average.5

Errors

Indicator for failures while serving your traffic. Usually measured in EPS (Errors Per Second).

Traffic

Normally measured in RPS (Requests Per Second) or QPS (Query Per Second).

Valid Monitoring Output

Alerts

An alert is something where a human needs to take action immediately to prevent a system crash or a degeneration of your service.

Tickets

Tickets are everything, where a human needs to take action but not now. Usually you should have enough time to fix this issue, in any other case it's an alert.

Logs

This metrics are only for diagnostic, forensic purposes and post mortems.

Defense In Depth

Failures will always happen. Get used to it. There is nothing you can do about it, but what you can do is tolerate them and have them automatically get fixed. If you design your system, that it is tolerating point failures you will have already one problem less.1

Graceful Degredation

"Graceful degradation is the ability to tolerate failures without having complete collapse. For example, if a user's network is running slowly, the Hangout video system will reduce the video resolution and preserve the audio. For Gmail, a slow network might mean that big attachments won't load, but users can still read their email. All these are automated responses that give you high availability without a human having to do anything."1

Wheel Of Misfortune

The "Wheel Of Misfortune" is a role-game, where a previous postmortem is reenacted with a cast of engineers playing roles as laid out in the postmortem.6

Mean Time To Recover (MTTR)

MTTR is the average time that a device will take to recover from any failure.3

Mean Time Between Failures (MTBF)

MTBF is the predicted elapsed time between inherent failures of a mechanical or electronic system, during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems, while mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system.4

Mean Time To Failure (MTTF)

MTTF denotes the expected time to failure for a non-repairable system.4

Service Level Indicator (SLI)

SLI is a carefully defined quantitative measure of some aspect of the level of service that is provided.5

Service Level Objective (SLO)

SLO is a target value or range of values for a service level that is measured by an SLI.5

Service Level Agreement (SLA)

SLA is a (legal) agreement with repercussions for failure to meet.5

Capability Maturity Model (CMM)

CMM is a development model created after a study of data collected from organizations that contracted with the U.S. Department of Defense, who funded the research. The term "maturity" relates to the degree of formality and optimization of processes, from ad hoc practices, to formally defined steps, to managed result metrics, to active optimization of the processes.7

Postmortem

Sources

1 Google SRE Interview, Niall Murphy and Ben Treynor, "What is 'Site Reliability Engineering', 2018-09-26, https://landing.google.com/sre/interview/ben-treynor.html ↩
2 https://interworks.com/blog/rclapp/2010/05/06/what-does-availabilityuptime-mean-real-world/ ↩
3 https://en.wikipedia.org/wiki/Mean_time_to_recovery ↩
4 https://en.wikipedia.org/wiki/Mean_time_between_failures ↩
5 Google Cloud Next 2018: Nori and Dan, "Best Practices from Google SRE", 2018-07-26, https://www.youtube.com/watch?v=XPtoEjqJexs ↩
6 "Postmortem Culture: Learning from Failure", John Lunney and Sue Lueder, 2018-09-26, https://landing.google.com/sre/book/chapters/postmortem-culture.html ↩
7 https://en.wikipedia.org/wiki/Capability_Maturity_Model ↩

Additional Links

A more technical cheatsheet: https://github.com/michael-kehoe/awesome-sre-cheatsheets
DevOps: "Where do I start? Cheatsheet" by Microsoft: https://blogs.technet.microsoft.com/juliens/2016/02/14/devops-where-do-i-start-cheat-sheet/
"So you want to be an SRE" by hackernoon.com: https://hackernoon.com/so-you-want-to-be-an-sre-34e832357a8c
The Google SRE Landing page: https://google.com/sre

More Repositories

1

dotfiles

small scripts and config files
Vim Script
54
star
2

hikari

ZSH config with unique key shortcuts and some eye candy
Shell
39
star
3

cifs-exporter

SMB/CIFS Prometheus Exporter
Go
22
star
4

kubectl-htpasswd

kubectl plugin for generating nginx-ingress compatible basic-auth secrets on kubernetes clusters
Go
18
star
5

nyan

a simple netcat wrapper
Shell
15
star
6

systemd-for-administrators

A systemd-Handbook written by Lennart Poettering
TeX
14
star
7

pkgbuilds

Arch Linux community repository pkgbuilds
Shell
9
star
8

iwd

Go dbus API bindings for the Internet wireless Daemon "iwd"
Go
8
star
9

Podcast-dl

Download famous german podcasts
Shell
7
star
10

batteryd

The smallest battery daemon ever!
C
7
star
11

master-thesis

This will be the git repository for my master-thesis
TeX
6
star
12

ryoukai

my i3 statusbar based on https://barista.run
Go
5
star
13

mnemonic

mnemonic is a diceware alike password generator using word lists written in Go
Go
5
star
14

shibumi.dev

HTML
5
star
15

secure-supply-chain-example

Supply Chain Security does not need to be difficult
Go
4
star
16

infra

My personal infrastructure managed by terraform
HCL
4
star
17

mknspawn

A systemd-nspawn/systemd-machined wrapper written in shell
Shell
4
star
18

IRCollection

A collection of useful scripts for Internet Relay Chatting (IRC)
Python
3
star
19

bachelor-thesis

My bachelor-thesis at TU Clausthal
TeX
2
star
20

Tor-project-paper

My Tor-project-paper
TeX
2
star
21

selenium-demo

small demo project for learning selenium
Python
2
star
22

torwget

torwget.py fork
Python
2
star
23

my-yara-rules

My yara rules
1
star
24

rofi-audio

a simple audio status bar implement with rofi
Shell
1
star
25

aoc2019

Advent of Code 2019
Go
1
star
26

bachelor-kolloquium

The slides for my bachelor defense
1
star
27

in-toto-installer

in-toto Github action
1
star
28

iwd-menu

iwd-menu is a wifi-menu like UI for iwd written in Go
Go
1
star
29

vt-cli

A small VirusTotal-Client written in C
C
1
star
30

dwupdate

a simple colourized statusbar for DWM
Shell
1
star
31

base16-qterminal

base16-qterminal colorschemes
1
star
32

prometheus-msteams-operator

Operator for spinning up prometheus-msteams bridges
Go
1
star
33

shibumi

1
star