• Stars
    star
    158
  • Rank 230,275 (Top 5 %)
  • Language
    HTML
  • License
    MIT License
  • Created almost 6 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A role-playing game for incident management training

Wheel of Misfortune

Wheel of Misfortune is a game that aims to build confidence in on-call engineers via simulated outage scenarios. With the game, you practice problem debugging under stress, understanding the incident management protocol, and effective communication with other engineers of your team and organization. It is a great way to train new hires, interns, and seasoned engineers to become well-rounded on-call engineers.

The game is inspired by the Site Reliability Engineering book.

Demo website

Instructions

Terminology

  • Scenario: A past or fictional incident case.
  • Game Master: The host-coordinator of the session.
  • Volunteer: The trainee on-call engineer.

Feel free to fork the repository or download the stable release. Copy the general_incidents.json.sample file to general_incidents.json, inside the incidents/ directory, and insert your incident scenarios into it.

To run the game locally on your computer please navigate the the main directory of the downloaded project i.e. wheel-of-misfortune-5.0/ and from here start a web server i.e. python -m SimpleHTTPServer after that open the http://localhost:8000 using your web browser.

The file has the following format:

  • ID: the unique ID of the outage (you can just auto-increment).
  • title: the title of the incident.
  • scenario: the description of the incident. It is useful to include URLs from monitoring systems, dashboards, time-series databases and playbooks.
  • inkstory: the path to an Ink story file in JSON format.

You can also use general_incidents.jsonnet as an example, in case you want to generate your incident scenarios using Jsonnet.

Ink

Ink is a scripting language for writing interactive narrative stories. It enables us to write interactive incident response narratives for team or individual trainings. You can use Inky to write an interactive narrative for an incident and then export the story as JSON. Then, you can store the story file inside the incidents/ folder and associate the Ink story file with an Incident scenario using the inkstory key. You can read an example incident narrative here.

Role Playing

Game Master

  1. Choose a volunteer to be the primary on-call engineer in front of the group.
  2. Find a balance between the volunteer's experience and the incident's difficulty.
  3. Assist volunteer by answering questions that may arise in each theoretical action or dashboard observation.
  • Engage with the rest of the team and ask for different ways to debug the problem following the volunteer's explanation.
  • Team members may be made available over time for assistance in various topics.
  1. At the end, have a debrief on the learnings of the session.

Volunteer

  1. Spin the wheel and attempt to fix the theoretical outage scenario.
  2. Explain to the Game Master and the rest of the group what actions you would take (lookup queries, checks in dashboards, etc.) to find the root causes, and eventually solve the incident.
  3. Always keep an eye on the time, since it is a simulated incident response scenario and not a routine troubleshooting process. During a real incident, you might have an SLA or SLO breach and therefore you should take timing into account.
  4. Engage with the rest of the group. Keep them in the loop. Ask questions to different members depending on their expertise.

Most importantly, have fun!

You can read a comprehensive example on how to conduct the exercise in the Google SRE book.

Featured

The Wheel of Misfortune was established as a practice in Open Practice Library and this project was featured there.

Resources

More Repositories

1

awesome-sre

A curated list of Site Reliability and Production Engineering resources.
11,476
star
2

awesome-chaos-engineering

A curated list of Chaos Engineering resources.
5,779
star
3

postmortem-templates

A collection of postmortem templates
1,208
star
4

availability-calculator

Calculate how much downtime should be permitted in your Service Level Agreement or Objective
HTML
63
star
5

kubectl-janitor

List Kubernetes objects in a problematic state
Go
54
star
6

sreworkbook-templates-md

A collection templates ported from the SRE Workbook
35
star
7

vegeta-operator

Kubernetes Operator for running HTTP load testing scenarios with Vegeta
Go
32
star
8

common-disaster-recovery-scenarios

A list of common Disaster Recovery (DR) scenarios for software companies
31
star
9

tc-panel

Geo-Distributed Infrastructure Emulation using Traffic Shaping
Python
12
star
10

ansible-rpi-cluster

Automate common tasks in your Raspberry Pi cluster with Ansible
9
star
11

strgz

CLI tool for listing and searching users' starred repositories on Github
Go
7
star
12

tidyman

Script for tiding files
Shell
5
star
13

oomutil

A Go package with read-only operations for determining the Out-Of-Memory (OOM) status of a process on Linux
Go
5
star
14

error-budget-calculator

Calculate the tolerable downtime of your service
HTML
5
star
15

py-deterministic-subsetting

Deterministic Subsetting as defined in the SRE book
Python
4
star
16

tmux-load-avg

tmux plugin that displays the system load average in the last 1, 5 and 15 minutes.
Shell
4
star
17

fstree

A tool that generates a depth indented listing of files and sub-directories in a tree-like format.
Go
3
star
18

golang-timemap

Time-based key-value store for Go
Go
3
star
19

proctree

A tool to display a tree of running processes
Go
3
star
20

gopheracademy-advent2019-tcp-no-delay

Material from my GopherAcademy Advent 2019 blog post about TCP_NODELAY and Go
Go
3
star
21

dastergon.github.io

Personal website
HTML
2
star
22

rawdog-list-authors

rawdog plugin to display authors list
2
star
23

sremuc

Site Reliability Engineering Munich Meetup Page
CSS
1
star
24

rawdog

HTML
1
star
25

venus2rawdog

Command line utility to migrate config files from RSS aggregator venus to rawdog
Python
1
star
26

sampler-recipes

A list of sampler recipes to run on the CLI
1
star