SRE Roadmap
An opinionated roadmap to become an SRE (Concepts > Tools)
Distributed systems
- Concepts
- Fallacies of distributed computing
- Synchronous vs. asynchronous
- Event log vs. message queue
- Exactly-once delivery
- Different types of message failure
- Orchestration vs. choreography
- Causality
- CDN
- Hashing
- Consistent hashing
- Geohashing
- Perfect hashing
- Read-heavy vs. write-heavy impacts
- Federation
- Latency
- Latency, throughput, goodput
- Latency numbers every programmer should know
- How to prevent latency variability
- Tail latency
- How to reduce sharing
- Idempotency
- Load balancer
- Concepts
- Layer 4 vs. layer 7 load balancer
- Liveness vs. safety properties
- Microservices: pros and cons
- REST
- gRPC
- Service mesh
- Source of truth
- Stateful vs. stateless
- Total vs. partial order
- Why can't we rely on the system clock in distributed systems
- Vector clock
- Cache
- When to use a cache
- Cache-aside vs. read-through
- Eviction policy
- Refresh-ahead
- Write-through vs. write-back
- Distributed cache
- Performance cache vs. capacity cache
- Databases
- Different types of databases
- NoSQL vs. SQL databases
- Relational vs. document
- Column-oriented databases
- Graph databases
- Vector database
- Objects-based storage
- ACID
- Partitioning
- Criteria
- Methods
- Replication vs. partition
- Hotspot
- CALM theorem
- CAP theorem
- PACELC theorem
- Cardinality
- Chain replication
- Consensus
- Concurrency control
- Consistency models
- Isolation levels
- Serializability
- Linearizability
- CRDT
- Indexes
- Tradeoff
- Primary vs. secondary indexes
- Denormalization
- View & materialized view
- Transaction
- Distributed transactions downsides
- Strategies to handle rebalancing
- Leader election
- MVCC
- N+1 select problem
- Quorum
- Raft
- Read repair
- Single-leader, multi-leader, leaderless replication
- Split-brain
- 2PC
- 3PC
- WAL
- Write and read amplification
- Different types of databases
- Data structure
- Probabilistic data structures
- Bloom filter
- Count-min sketch
- HyperLogLog
- Storage
- LSM tree
- B-tree
- SSTable
- Probabilistic data structures
Reliability
- Concepts
- Difference between availability, resiliency, robustness, fault-tolerance, and reliability
- Why is it wrong to target 100% availability
- Blast radius
- Failure domain
- Cascading failures
- Hard vs. soft dependencies
- Scalability
- Concepts
- Knee point
- Ceiling
- Number one source of outages
- Tail tolerance
- Toil
- Patterns/Anti-patterns
- Bulkhead pattern
- Circuit breaker
- Exponential backoff
- Jitter
- Graceful degradation
- Load shedding
- Retry amplification
- Backpressure
- Rate limiting
- Request hedging
- Practices
- Chaos engineering
Observability
- Concepts
- What's the difference between monitoring and observability
- Trace vs. metric vs. log
- Golden signals
- Observer effect
- Percentile
- Streetlight anti-method
- Time-series based monitoring lies
- USE method
- Main metrics for cache
- Why should we be careful about average performance metrics
- Alerting
- Alerting strategy
- Alerting fatigue concept
- Characteristic of a good alert
- Slow vs. fast burn alert
Rollout
- Concepts
- Bake time
- Feature flag
- Feature freeze
- Rollout supervision
- Rollout types
- Blue green rollout
- Canary rollout
- Progressive rollout
- Shadow rollout
SLI/SLO/SLA
- Concepts
- SLI vs. SLO vs. SLA
- Error budget
- SLO
- Difference between KPIs and SLOs
- Benefits of having alerts based on SLOs
- Why is exceeding an SLO not necessarily a good thing
- SLO for data (freshness, completeness, consistency, etc.)
- SLO for mobiles
- SLO for services
Container
- Container
- Container orchestration
Linux
- Scripting
- Filesystem
- Memory
- Processes
- Resource utilization
- Network
Network
- ARP protocol
- Bandwidth
- BGP
- CoDel
- CORS
- DNS
- Ping vs. heartbeat
- TCP
- TCP vs. UDP
- Congestion control
- Connection backlog
- Flow control
- Handshake
- HTTP
- HTTP/2
- Head of line blocking
- Health checks: passive vs. active
- Internet model
- NTP
- OSI model
- Routers
- Switch
- Network topologies
- What happens if you type google.com in your browser
Security
- Authentication
- Certificate
- Certificate authority
- Cipher
- Confidentiality
- Encryption
- TLS
- PKI
- Signature
Analysis
- Core analysis loop
- Correlation vs. causation
- First principle
- Five whys technique
- Incident management
- How to address an incident (assess, mitigate, resolve)
- Incident roles
- How to write a postmortem
- 3C principles (Coordinate, Communicate, maintain Control)
Other
- SRE role
- Version control
Soft skills
- Communication
- Writing
- Oral
- Presentation
- The XY problem
- Collaboration
- Problem solving
- Curiosity
- Navigating ambiguity
- Staying humble