SRE Roadmap

An opinionated roadmap to become an SRE (Concepts > Tools)

Distributed systems

Concepts
- Fallacies of distributed computing
- Synchronous vs. asynchronous
- Event log vs. message queue
- Exactly-once delivery
- Different types of message failure
- Orchestration vs. choreography
- Causality
- CDN
- Hashing
  - Consistent hashing
  - Geohashing
  - Perfect hashing
- Read-heavy vs. write-heavy impacts
- Federation
- Latency
  - Latency, throughput, goodput
  - Latency numbers every programmer should know
  - How to prevent latency variability
  - Tail latency
- How to reduce sharing
- Idempotency
- Load balancer
  - Concepts
  - Layer 4 vs. layer 7 load balancer
- Liveness vs. safety properties
- Microservices: pros and cons
- REST
- gRPC
- Service mesh
- Source of truth
- Stateful vs. stateless
- Total vs. partial order
- Why can't we rely on the system clock in distributed systems
- Vector clock
Cache
- When to use a cache
- Cache-aside vs. read-through
- Eviction policy
- Refresh-ahead
- Write-through vs. write-back
- Distributed cache
- Performance cache vs. capacity cache
Databases
- Different types of databases
  - NoSQL vs. SQL databases
  - Relational vs. document
  - Column-oriented databases
  - Graph databases
  - Vector database
  - Objects-based storage
- ACID
- Partitioning
  - Criteria
  - Methods
  - Replication vs. partition
- Hotspot
- CALM theorem
- CAP theorem
- PACELC theorem
- Cardinality
- Chain replication
- Consensus
- Concurrency control
- Consistency models
- Isolation levels
- Serializability
- Linearizability
- CRDT
- Indexes
  - Tradeoff
  - Primary vs. secondary indexes
- Denormalization
- View & materialized view
- Transaction
- Distributed transactions downsides
- Strategies to handle rebalancing
- Leader election
- MVCC
- N+1 select problem
- Quorum
- Raft
- Read repair
- Single-leader, multi-leader, leaderless replication
- Split-brain
- 2PC
- 3PC
- WAL
- Write and read amplification
Data structure
- Probabilistic data structures
  - Bloom filter
  - Count-min sketch
  - HyperLogLog
- Storage
  - LSM tree
  - B-tree
  - SSTable

Reliability

Concepts
- Difference between availability, resiliency, robustness, fault-tolerance, and reliability
- Why is it wrong to target 100% availability
- Blast radius
- Failure domain
- Cascading failures
- Hard vs. soft dependencies
- Scalability
  - Concepts
  - Knee point
  - Ceiling
- Number one source of outages
- Tail tolerance
- Toil
Patterns/Anti-patterns
- Bulkhead pattern
- Circuit breaker
- Exponential backoff
- Jitter
- Graceful degradation
- Load shedding
- Retry amplification
- Backpressure
- Rate limiting
- Request hedging
Practices
- Chaos engineering

Observability

Concepts
- What's the difference between monitoring and observability
- Trace vs. metric vs. log
- Golden signals
- Observer effect
- Percentile
- Streetlight anti-method
- Time-series based monitoring lies
- USE method
- Main metrics for cache
- Why should we be careful about average performance metrics
Alerting
- Alerting strategy
- Alerting fatigue concept
- Characteristic of a good alert
- Slow vs. fast burn alert

Rollout

Concepts
- Bake time
- Feature flag
- Feature freeze
- Rollout supervision
Rollout types
- Blue green rollout
- Canary rollout
- Progressive rollout
- Shadow rollout

SLI/SLO/SLA

Concepts
- SLI vs. SLO vs. SLA
- Error budget
SLO
- Difference between KPIs and SLOs
- Benefits of having alerts based on SLOs
- Why is exceeding an SLO not necessarily a good thing
- SLO for data (freshness, completeness, consistency, etc.)
- SLO for mobiles
- SLO for services

Container

Container
Container orchestration

Linux

Scripting
Filesystem
Memory
Processes
Resource utilization
Network

Network

ARP protocol
Bandwidth
BGP
CoDel
CORS
DNS
Ping vs. heartbeat
TCP
- TCP vs. UDP
- Congestion control
- Connection backlog
- Flow control
- Handshake
HTTP
HTTP/2
Head of line blocking
Health checks: passive vs. active
Internet model
NTP
OSI model
Routers
Switch
Network topologies
What happens if you type google.com in your browser

Security

Authentication
Certificate
Certificate authority
Cipher
Confidentiality
Encryption
TLS
PKI
Signature

Analysis

Core analysis loop
Correlation vs. causation
First principle
Five whys technique
Incident management
- How to address an incident (assess, mitigate, resolve)
- Incident roles
- How to write a postmortem
- 3C principles (Coordinate, Communicate, maintain Control)

Other

SRE role
Version control

Soft skills

Communication
- Writing
- Oral
- Presentation
- The XY problem
Collaboration
Problem solving
Curiosity
Navigating ambiguity
Staying humble

teivah/sre-roadmap

teivah

Reviews

Repository Details

SRE Roadmap

Distributed systems

Reliability

Observability

Rollout

SLI/SLO/SLA

Container

Linux

Network

Security

Analysis

Other

Soft skills

More Repositories