• Stars
    star
    129
  • Rank 279,262 (Top 6 %)
  • Language
    Go
  • Created over 3 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

SLOs, Error windows and alerts are complicated. Here an attempt to make it easy

slo-computer

Note

Around two years ago @last9 started advocating using Service Level Objectives. One of the biggest challenges was the lack of practical algorithms behind Burn Rate and alerting. This was our first attempt at it. If you would like us to release these algorithms, go ahead and help us reach 250 stars ⭐️.

SLOs, Error windows and alerts are complicated. Here's an attempt to make it easy.

SLO, burn_rate, error_rate, budget_spend are convoluted terms that can throw one off. Even the SRE workbook by Google can leave you with a lot of open questions.

The concept of SLOs and SLIs has existed for a long time now, but we continue to be amazed by how widely misunderstood this topic is (and how easy it can make your lives if used well).

We are building a sandbox for our DevOps and SRE community - SLO computer - a product that makes setting and monitoring SLOs for all your services intuitively seamless and blazingly fast.

Usage

usage: slo [<flags>] <command> [<args> ...]

Last9 SLO toolkit

Flags:
  --help     Show context-sensitive help (also try --help-long and --help-man).
  --version  Show application version.

Commands:
  help [<command>...]
    Show help.

  suggest --throughput=THROUGHPUT --slo=SLO --duration=DURATION
    suggest alerts based on service throughput and SLO duration

  cpu-suggest --instance=INSTANCE --utilization=UTILIZATION
    suggest alerts based on CPU utilization and Instance type

The goal of this command (has an importable lib too) is to factor in some "bare minimum" input to

  • Is this a Low traffic service in which case it makes little sense to use an SLO approach
  • Compute the actual alert values and condition to set alerts on

Examples

Q: What alerts should I set for my service to achieve 99.9 % availability over 30 days

✗ ./slo-computer suggest --throughput=4200 --slo=99.9 --duration=720

		Alert if error_rate > 0.002 for last [24h0m0s] and also last [2h0m0s]
		This alert will trigger once 6.67% of error budget is consumed,
		and leaves 360h0m0s before the SLO is defeated.


		Alert if error_rate > 0.010 for last [1h0m0s] and also last [5m0s]
		This alert will trigger once 1.39% of error budget is consumed,
		and leaves 72h0m0s before the SLO is defeated.

Q: What alerts should I set for my service with throughpput 100rpm to achieve 90 % availability over 7 days

✗ ./slo-computer suggest --throughput=100 --slo=99.9 --duration=168
slo-computer: error:
	If this service reported 10.000 errors for a duration of 5m0s
	SLO (for the entire duration) will be defeated wihin 1h40m47s

	Probably
	- Use ONLY spike alert model, and not SLOs (easiest)
	- Reduce the MTTR for this service (toughest)
	- SLO is too aggressive and can be lowerd (business decision)
	- Combine multiple services into one single service (teamwide)
, try --help

Q: What alerts should I set for my burst CPU

✗ ./slo-computer cpu-suggest --instance=t3a.xlarge --utilization=15

	Alert if 100.00 % consumption sustains for 10m0s AND recent 5m0s.
	At this rate, burst credits will deplete after 10h0m0s


	Alert if 80.00 % consumption sustains for 3h45m0s AND recent 55m0s.
	At this rate, burst credits will deplete after 15h0m0s

About Last9

This project is sponsored and maintained by Last9. Last9 builds reliability tools for SRE and DevOps.

More Repositories

1

openapm-nodejs

APM for NodeJS using Prometheus
TypeScript
62
star
2

k8stream

Processing kubenetes events stream
Go
37
star
3

openmetrics-registry

Do more with your metrics
HCL
31
star
4

tfstate

Restful Terraform remote state server
Go
29
star
5

awesome-prometheus-toolkit

Alert rules toolkit for Prometheus. Connect Prometheus, discover alert rules, apply!
TypeScript
29
star
6

iox-registry

HCL
23
star
7

pyhystrix

Hystrix brought to Python
Python
22
star
8

timescaledb-metrics

Send TimescaleDB policy stats (and other things) as metrics.
Go
21
star
9

terraform-provider-papertrail

Papertrail support for Terraform
Go
15
star
10

last9-integrations

Sample applications of supported integrations by Last9 Products
Python
14
star
11

openapm-ruby

APM For Ruby using Open Source Tools
Ruby
12
star
12

last9-cdk

Last9 CDK
Go
9
star
13

open-telemetry-cheat-sheet

A comprehensive collection of OpenTelemetry examples, configurations, and best practices for various scenarios, involving Kubernetes.
6
star
14

otel-playground

OpenTelemetry Plaground app
3
star
15

k8s-events-dictionary

CSS
2
star
16

blog-articles

All blog articles from https://last9.io/blog
2
star
17

levitate-streaming-aggregation-example

2
star
18

faas-monitoring

Example monitoring setup for functions as a service
Python
1
star
19

format-hcl-and-run-ome-plan

A github action to check formatting of HCL files and create openmetrics-exporter plan.
1
star
20

dataframe-tools

Tools to operate on Dataframes
1
star
21

openmetrics-exporter-tutorials

Tutorials for Openmetrics-Exporter
1
star
22

open-sre-graph

The graph for Software Reliability needs
1
star
23

vmagent-operator-levitate

This is set of Kubernetes configs to setup VMAgent operator with scrape configs in any Kubernetes Cluster.
1
star
24

site-reliability-tools

Map of Tools for Software Observability, Reliability & Monitoring
1
star
25

levitate-events-integration-example

This is an example repository which stores the JSONSchema spec, as well as aggregation pipelines to be executed on the events. Feel free to fork.
1
star
26

prometheus-ecs-sd

A Python script that generates a file which prometheus's `file_sd_config` watches and updates its scrape targets.
HCL
1
star
27

openmetric-instrumentation-examples

Examples of Openmetric Instrumetation in different languages
JavaScript
1
star
28

resource-templates

Discovery, alert configuration and dashboards templates for resources.
1
star
29

prometheus-exporters

Curated list of standard Prometheus exporters for various components and services, enabling comprehensive monitoring and metrics collection across diverse systems and platforms.
1
star
30

opentelemetry-examples

1
star