• This repository has been archived on 03/Jan/2023
  • Stars
    star
    761
  • Rank 59,698 (Top 2 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created over 10 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

On call alert classification and reporting

Opsweekly Build Status

Deploy

What is Opsweekly?

Opsweekly is a weekly report tracker, an on call categorisation and reporting tool, a sleep tracker, a meeting organiser and a coffee maker all in one.

The goal of Opsweekly is to both organise your team into one central place, but also helps you understand and improve your on call rotations through the use of a simple on call "survey", and reporting as a result of that tracking.

Alert classification is a complicated task, but with Opsweekly a few simple questions about each alert received can pay dividends in improving the on call experience for your engineers.

Features

  • Weekly Updates: Every member of your team can write a weekly status update using hints (e.g. Github commits, JIRA tickets) to inform the team what they've been working on, and then optionally email it out.
  • On-Call Alert Classification: Track, measure and improve your on call rotations by allowing your engineers to easily classify and document the alerts they received.
    • Make a simple assesment for each alert that relates to whether action was taken, no action was taken, or whether the alert needs modification for follow up later
    • Free notes field to allow documentation of actions taken to refer back to later
    • Bulk classification for time saving
  • Sleep Tracking: If your engineers have popular life tracking devices such as Fitbit or Jawbone UP, they integrate with Opsweekly to provide even more insight into the effect on call is having on their lives.
    • Mean time to sleep (MTTS) and sleep time lost to notifications are calculated.
    • Easy to configure but gives valuable data that could lead to questions like "can this alert wait until morning as it keeps waking up our engineers?"
  • In depth reporting: As you start to build up data, Opsweekly starts to generate reports and graphs illustrating your on call rotations.
    • Examples include: action taken vs no action taken on alerts, what alerts wake people up the most, mean time to sleep, top notifying hosts/services, average alert volume per day, and how on call has improved (or not) over the last year
  • Personal reporting: As well as a summary for all rotations, users are able to gain insight into their own behaviours.
    • How have their on calls affected them?
    • How much sleep do they lose on average?
    • How does this compare to others?
    • Optionally, one can view a sleep retrospective that compares the last several on-call rotations' impact on sleep loss.
      • In phplib/config.php, define oncall_sleep_retrospective_count with a numeric value (such as 3). Users viewing their profile will then see how past weeks affected their sleep.
  • Meeting Mode: Make running a weekly meeting simple with all the data you need in one page, and a facility for people to take notes.
    • Meeting mode hides all UI displaying only information required for the meeting.
    • The on call report for the previous week is included, along with key stats and elements from report
    • All weekly updates are displayed in case items need to be discussed
    • Set up a cron to remind people about the weekly meeting and provide the permalink to the meeting
  • Powerful Search: All data is searchable using a powerful search function. The default search mode is fuzzy, which will return results from all data stored in Opsweekly. However, you can get more specific:
    • Search previous on call alerts for a history of that alert, previous engineer's notes, how the alerts were classified (is this alert constantly "no action taken?") and a time map showing it's frequency over the past year.
    • Search Weekly Updates for full context on changes made previously
    • Search Meeting Notes for agenda items discussed in previous meetings
  • Fully timezone aware: Obviously it's important for users to be editing the alerts they receive in the timezone that they received them in. Each user can set their own timezone for the whole Opsweekly UI.
  • Fill in as you go/drafts: Both the Weekly report and the On-call reports can be updated to multiple times during the week, so the user does not have to edit a hefty report at the end.

Screenshots

Please visit the screenshot README for a guided tour of how Opsweekly works and the reports it can generate!

Prerequisites

  • A webserver
  • PHP 5.4 (or higher), including the curl extensions for PHP, MySQL extension, and short_open_tags enabled
  • MySQL for data storage

Installation/configuration

  1. Download/clone the repo into an appropriate folder either in your webservers directory or symlinked to it. or:
  2. Create a configuration in your webserver for Opsweekly, if using it as a seperate domain (e.g. VirtualHost)
  3. You must increase the PHP variable max_input_vars for submitting on-call reports. See Increasing max input vars
  4. Create a MySQL database for opsweekly, and optionally grant a new user access to it. E.g.:
    • mysql> create database opsweekly;
    • mysql> grant all on opsweekly.* to opsweekly_user@localhost IDENTIFIED BY 'my_password';
  5. Load the database schema into MySQL, e.g. mysql -u opsweekly_user opsweekly < opsweekly.sql
  6. Teach Opsweekly how to authenticate your users.
  7. Move phplib/config.php.example to phplib/config.php, edit with your favourite editor (more detail below)
  8. Load Opsweekly in your browser
  9. Reward yourself with a refreshing beverage.

Upgrading

We're careful to only allow changes that should be backwards compatible with previous versions of opsweekly, e.g. if a new configuration value is added, a sensible default is included, etc.

Having said that, sometimes database schema changes are required. The script upgrade_db.php will attempt to alter your tables for those schema changes; if it fails, you can copy and paste the SQL and run manually. Re-running the upgrade_db.php more than once will not break your database.

Commiters/Maintainers: If you add a new database column, please add your schema change to upgrade_db.php so existing users can enjoy the features you add!

Providers/Plugins

Opsweekly uses the concept of "providers" for the various pieces of data it needs. These are like plugins and can vary from team to team.

The following providers are used:

  • providers/weekly/: These are known as weekly "hints" which are used to helpfully hint or remind people what they did in the last week when writing their reports.
    • Weekly hint provider peoples include Github (showing recent commit activity) or JIRA (showing tickets closed in that time period)
  • providers/oncall/: These are used to pull in notifications from somewhere for the on call engineer to document.
    • For example, if you're using Logstash or Splunk to parse your Nagios logs, or pull in alerts sent to Pagerduty.
  • providers/sleep: These are used to query an external datasource to establish whether the on call engineer was asleep during the notifications he or she received.
    • Opsweekly has been tested with Jawbone UP and Fitbit sleep trackers with success

The theory behind the providers mean if Opsweekly is not pulling data from a service you're currently using, it should be trivial to write your own and plug them in. Generally providers have two sets of configuration: One global for your entire instance, and then one config per team (or user, in the case of sleep)

For more information about how to configure the providers or to write your own, please see the documentation in each of the provider directories mentioned above.

Configuration

The config.php.example contains an example configuration to get you on your way. It's fairly well commented to explain the common options, but we'll go into more depth here:

Authenticating with Opsweekly

It's very important that Opsweekly knows who everyone who uses Opsweekly is, so the first step of using Opsweekly is to teach it how to understand who people are.

In config.php, there is the important function, getUsername. This function must return the username, for example, "ldenness". You can write whatever PHP you like here; perhaps your SSO passes a HTTP header, or sets a cookie you can read to get the username.

The config.php.example has a couple of examples, one that will use the username from HTTP Basic Auth that can be configured with Apache.

Increasing max input vars

PHP has a default limit of the number of variables that can be input via form submission. Because compiling and submitting the on-call report is essentially just submitting a giant form, you must increase this value or your reports will be truncated!

Look for the configuration option max_input_vars in your PHP configuration (e.g. php.ini) or if you have your own Virtualhost (e.g. in Apache) you can do something like: php_value max_input_vars 10000 to increase the limit.

We highly suggest increasing to 10000 for future proofing your on-call reports. There's no real downside to this if you're limiting it to Opsweekly. The limit is to try and protect against exploits by hash collisions (basically, someone DoS-ing forms on your site). But you should not run Opsweekly exposed on the internet anyway.

Teams configuration

Opsweekly has the ability to support many different teams using the same codebase, if required. Each team gets it's own "copy" of the UI at a unique URL, and their data is stored in a seperate database.

Even if you only intend to use one team, the $teams array contains most of the important configuration for Opsweekly.

The key of the array(s) in the $teams array is the FQDN that you will access Opsweekly via, e.g. opsweekly.mycompany.com.

Inside this array are many configuration options:

  • display_name: The "friendly" or display name for your team is used throughout the UI to describe your team. For example, "Ops"
  • root_url: If your installation is on a path other than "/", enter the path here. For example, if your desired URL is "http://intranet.mycompany.com/opsweekly" would enter "/opsweekly".
  • email_report_to: The email address of the mailing list your team uses to communicate, used for sending weekly reports (if the person requests it) or any other email communication.
  • database: The name of the MySQL database Opsweekly will try and use for this team
  • oncall: Either false or another array containing configuration regarding your on call rotations.
    • If you wish for this team's on call data to be tracked, this should be an array containing the following information:
    • provider: Which on call provider you wish to use for this team to fetch information, for example "splunk", "logstash" or "pagerduty"
    • provider_options: An array of team unique configuration options that this plugin requires. The list of these is available in the documentation for the provider itself. For example, Pageduty will require the service ID.
    • timezone: The PHP style timezone that this team operates in, or rather the timezone that your on call rotation starts in. A great example here is to take this (and the following two variables) directly from Pagerduty if you use that for scheduling your on call rotations
    • start: The time when your on call rotation starts. This is input into strtotime so it can be friendly text like "friday 18:00" for 6pm on Friday
    • end: As above, except when your on call rotation ends.
  • weekly_hints: The weekly hint providers you wish to use for these team to prompt people to fill in their weekly reports. There are examples in the providers/weekly folder, for example Github (pulling in recent commits) and JIRA (pulling in closed tickets)
  • irc_channel: The IRC channel your team uses. Used for various IRC integrations (currently just warning about weekly meeting time, if cron is set up)

You can have as many teams as you want in the $teams array, they just need to have unique FQDNs.

Weekly "hint" provider configuration

In this section you define and confgure the available weekly hint providers. These are displayed on the right hand side of the "Add" page so people have some information infront of them about what they did for a prompt to write their updates.

Of course, you are free to write your own that suits your needs. If you wish to do so, please see the documentation inside of the providers/weekly folder.

The $weekly_providers array handles the definition and configuring of the plugins in the providers/weekly folder. The array key should be a simple name of your provider, e.g. "github". This name is referred to in the teams configuration under weekly_hints. Then as values inside the array, the following are required:

  • display_name: Displayed above the output from your plugin, this is the friendly header name for your provider, e.g. "Last week's tickets"
  • lib: The path to the PHP file that contains your provider, e.g. providers/weekly/github.php
  • class: The class name you're using for your weekly provider, which will be created if requested by the team configuration
  • options: An array of arbritrary key/value pairs that are passed into the provider when it's loaded, used for configuration that is to be shared between all teams. For example, a path to an API, or a username and password to login to an API.

On call provider configuration

In this section you define and configure the available on call notification providers. On call providers are plugins that given a time period and a username (and the configuration we will enter both here and in the team configuration) will fetch all the notifications the person received in that time period, so they can classify the alerts.

Of course, you are free to write your own that suits your needs. If you wish to do so, please see the documentation inside of the providers/oncall folder.

The $oncall_providers array handles the definition and configuring of the plugins in the providers/oncall folder. The array key should be a simple name of your provider, e.g. "pagerduty". This name is referred to inside the teams configuration in the on call section as provider. Then as values inside the array, the following are required:

  • display_name: A friendly, display name for your provider (e.g. Pagerduty)
  • lib: The path to the PHP file that contains your provider code, e.g. providers/oncall/pagerduty.php
  • options: An array of arbritrary key/value pairs that are passed into the provider when it's loaded, used for configuration that is to be shared between all teams. For example, a path to an API, or a username and password to login to an API.

Sleep provider configuration

In this section you can define and configure the sleep providers that users can choose in their "Edit Profile" screen. Sleep providers are plugins that given a unix timestamp, will return data on the sleep state of the user (for example, were they asleep and how deep asleep were they, and did they/how long did it take for them to get back to sleep)

We use this data to generate interesting reports about how on call rotations are affecting engineers sleep patterns, and help the team try and improve this required practice. For example, by listing alerts that most woke engineers, you could make a concious decision to wait to send that alert until morning, if it's not urgent enough.

The data is only stored alongside the notifications in the MySQL database, never shared.

Of course, you are free to write your own that suits your needs. If you wish to do so, please see the documentation inside of the providers/sleep folder.

The $sleep_providers array handles the definition and configuring of the plugins in the providers/sleep folder. The array key should be a simple name of your provider. The values must include the following:

  • display_name: A friendly name to display on the UI of Opsweekly for this provider. E.g. "Jawbone UP"
  • description: A description of the sleep tracker, to differentiate it from others
  • logo: Please place a logo in an addressable location, e.g. in the /assets/sleep/ directory (30x30px) and place the URL path to it here.
  • options: An array of key/value pairs that will be used to display configuration options in the UI to users. Unlike other providers, sleep tracking is a per user subject, so configuration is entered via the "Edit Profile" screen, and stored in the database. Each option is parsed to create a HTML form input field. The key should be the option name. The following values are required:
    • type: The type of input field. Currently only text is supported/tested.
    • name: The friendly "field name" for the input box
    • description: The description of what the user shoud enter, displayed next to the input box
    • placeholder: Placeholder text displayed inside of the text box
  • lib: The path to the PHP file that contains your provider code, e.g. providers/sleep/up.php
  • You are also allowed to pass any other arbritray key/value pairs in. As the entire config array is passed to the plugin, you can retrieve any values that are applicable to Opsweekly as a whole, rather than per user (which are specified above)

Generic configuration

There are a few other configuration options, which are documented in the example config file. Some highlights include:

  • $mysql_host, $mysql_user, $mysql_pass: Global configuration for your MySQL database. Per team database configuration (e.g. the database name to use) goes inside the team config.
  • $email_from_domain: The domain name you use to send email, used for a "From" address when sending weekly reports.
  • $search_results_per_page: Allows control of the number of search results returned at once
  • $error_log_file: Opsweekly prints some events, especially relating to on call fetching and Sleep tracking to a debug log file. This log file can be extremely useful at debugging provider issues.
  • $dev_fqdn, $prod_fqdn: To allow ease of development, Opsweekly will preg_replace the hostname given to it to another hostname (which then matches your team names in the $teams array).
  • $irccat_hostname, $irccat_port: If you use irccat and wish to use meeting reminders, and have them appear in IRC, you will need to configure the hostname and port your irccat instance runs at here.

A note on on-call classification and categorisation

One of Opsweekly's core goals is to try and assist with thinking deeply about on call rotations and the notifications received during them.

A big part of this is requiring the on call engineer to categorise every alert they receive. If they receive 50+, this can be a daunting task.

We spent a long time trying to come up with a good balance of concise options to choose from, that provided a sufficient amount of detail but at the same time didn't overwhelm the user.

This is the list that we came up with:


Two types: Action/No Action

The main thing we wanted to record was whether an alert was actionable, or not actionable (e.g. was there a genuine problem that affected service of the system that the user had to intervene to fix)

Therefore, the alert categorisations are broken down into those two categories.

Action Taken Tags

The following are "Action Taken" tags, and their brief description:

  • Service Issue (View Clean): The service was affected, and this alert correctly saw the issue and alerted on it
  • Service Issue (timeperiod inappropriate, view clean): There was an issue but it could've alerted during another timeperiod, e.g. it could've waited until morning.
  • View Issue (network/site outage, service health quesitonable): The service could've been okay or broken, but the monitoring system did not have a clear view of the system to say either way, but an alert was fired.
    • Most common use for this is a cascading failure/alert storm due to network outage with no parents setup

No Action Taken Tags

The following are "No Action Taken" tags, and their brief description:

  • Time period inappropriate: This alert should go off during a different time period, e.g. during the waking hours
  • Work ongoing, downtime expired: Known work is occuring on this system, but the downtime/alert supression expired
  • Work ongoing, downtime not set: Known work is occuring on this system, but downtime/alert supression was not set at all (e.g. by accident)
  • Threshold adjustment required: An alert fired due to a misconfiguration of thresholds. The threshold should be adjusted.
  • Check is faulty/requires modification: An alert fired because the check malfuctioned, needs to be redesigned or otherwise requires modification

More than once engineers have been somewhat baffled by these choices, and asked for another option; but actaully, in all those cases it ended up being covered by another choice. This is actually a good thing, as it forces everyone to think about the intitial cause of the alert, rather than things wrapped up around it.

Hopefully by using Opsweekly you can become very aware of the kind of alerts your engineers are receiving, and then work to reduce noise and only wake/context switch your intelligent humans to do jobs when they're really required to do so.

Setting up meeting reminders

You can have opsweekly automatically email and IRC message you to remind you about meeting time, and provide the permalink to this week's meeting for convenience.

To do so, simply set up a cron (or other method of triggering script e.g. manually) with the following: php /path/to/opsweekly/send_meeting_reminder.php <your-configured-cname>

e.g., using cron, weekly at 2pm: 0 14 * * 3 php /var/www/opsweekly/send_meeting_reminder.php myweekly.yourdomain.com

Known issues/caveats/future goals

  • As the name implies, Opsweekly is rather tied to the concept of a week. In theory the database stores time ranges, but the UI is all based on a week's worth of data
    • At some point I invisage dropping the concept of a fixed time period and instead having "providers" that pull the periods people were on call, prompting them to fill in the data. E.g. Pagerduty: You were on call from X to Y, please categorise your alerts for that period.
  • Whilst users can fill in their weekly report and on call report as the week continues, two people cannot edit the same on call report otherwise duplicate events may appear in the reports. This is due to the timestamp being part of the unique key for alerts.
  • Garbage in, garbage out: There is no way to exclude items returned by the on call providers in the reports right now, outside of deleting anything from the database you're not happy with after compiling the on-call report. I want to add a "soft delete" so deliberately allow editing of reports for unforseen reasons (e.g. monitoring system goes wrong and didn't actually send any alerts but they were logged)

More Repositories

1

AndroidStaggeredGrid

An Android staggered grid view which supports multiple columns with rows of varying sizes.
Java
4,756
star
2

skyline

It'll detect your anomalies! Part of the Kale stack.
Python
2,135
star
3

logster

Parse log files, generate metrics for Graphite and Ganglia
Python
1,968
star
4

deployinator

Deployinate!
Ruby
1,878
star
5

morgue

post mortem tracker
PHP
1,017
star
6

411

An Alert Management Web Application
PHP
971
star
7

feature

Etsy's Feature flagging API used for operational rampups and A/B testing.
PHP
869
star
8

MIDAS

Mac Intrusion Detection Analysis System
833
star
9

oculus

The metric correlation component of Etsy's Kale system
Java
707
star
10

mctop

a top like tool for inspecting memcache key values in realtime
Ruby
507
star
11

supergrep

realtime log streamer
JavaScript
411
star
12

Conjecture

Scalable Machine Learning in Scalding
Java
361
star
13

statsd-jvm-profiler

Simple JVM Profiler Using StatsD and Other Metrics Backends
Java
330
star
14

nagios-herald

Add context to Nagios alerts
Ruby
322
star
15

dashboard

JavaScript
308
star
16

boundary-layer

Builds Airflow DAGs from configuration files. Powers all DAGs on the Etsy Data Platform
Python
262
star
17

Testing101

Etsy's educational materials on testing and design
PHP
262
star
18

DebriefingFacilitationGuide

Leading Groups at Etsy to Learn From Accidents
247
star
19

phpunit-extensions

Etsy PHPUnit Extensions
PHP
228
star
20

nagios_tools

Tools for use with Nagios
Python
173
star
21

open-api

We are working on a new version of Etsy’s Open API and want feedback from developers like you.
166
star
22

TryLib

TryLib is a simple php library that helps you generate a diff of your working copy and send it to Jenkins to run the test suite(s) on the latest code patched with your changes.
PHP
155
star
23

BugHunt-iOS

Objective-C
148
star
24

mod_realdoc

Apache module to support atomic deploys - http://codeascraft.com/2013/07/01/atomic-deploys-at-etsy/
C
128
star
25

ab

Etsy's little framework for A/B testing, feature ramp up, and more.
128
star
26

wpt-script

Scripts to generate WebPagetest tests and download results
PHP
121
star
27

applepay-php

A PHP extension that verifies and decrypts Apple Pay payment tokens
C
118
star
28

foodcritic-rules

Etsy's foodcritic rules
Ruby
115
star
29

kevin-middleware

This is an Express middleware that makes developing javascript in a monorepo easier.
JavaScript
110
star
30

mixer

a tool to initiate meetings by randomly pairing individuals
Go
100
star
31

cloud-jewels

Estimate energy consumption using GCP Billing Data
TSQL
96
star
32

jenkins-master-project

Jenkins Plugin: Master Project. Jenkins project type that allows for selection of sub-jobs to execute, watch, and report worst status of all sub-projects.
Java
83
star
33

Sahale

A Cascading Workflow Visualizer
JavaScript
83
star
34

PushBot

An IRC Bot for organizing code pushes
Java
79
star
35

cdncontrol

CLI tool for working with multiple CDNs
Ruby
79
star
36

rules_grafana

Bazel rules for building Grafana dashboards
Starlark
70
star
37

chef-whitelist

Simple library to enable host based rollouts of changes
Ruby
68
star
38

rfid-checkout

Low Frequency RFID check out/in client for Raspberry Pi
Python
64
star
39

Etsy-Engineering-Career-Ladder

Etsy's Engineering Career Ladder
HTML
61
star
40

Evokit

Rust
60
star
41

ELK-utils

Utilities for working with the ELK (Elasticsearch, Logstash, Kibana) stack
Ruby
59
star
42

incpath

PHP extension to support atomic deploys
C
52
star
43

arbiter

A utility for generating Oozie workflows from a YAML definition
Java
48
star
44

VIPERBuilder

Scaffolding for building apps in a clean way with VIPER architecture
Swift
41
star
45

chef-handlers

Chef handlers we use at Etsy
Ruby
40
star
46

sbt-checkstyle-plugin

SBT Plugin for Running Checkstyle on Java Sources
Scala
32
star
47

es-restlog

Plugin for logging Elasticsearch REST requests
Java
29
star
48

yubigpgkeyer

Script to make RSA authentication key generation on Yubikeys differently painful
Python
28
star
49

Apotheosis

Python
28
star
50

jenkins-deployinator

Jenkins Plugin: Deployinator. Links key deployinator information to Jenkins builds via the CLI.
Java
25
star
51

sbt-compile-quick-plugin

SBT Plugin for Compiling a Single File
Scala
25
star
52

geonames

Scripts for using Geonames
PHP
24
star
53

jading

cascading.jruby build and execution tool
16
star
54

etsy.github.com

Etsy! on Github!
HTML
16
star
55

divertsy-client

The Android client for running DIVERTsy, a waste stream recording tool to help track diversion rates.
Java
13
star
56

cdncontrol_ui

A web UI for Etsy's cdncontrol tool
CSS
13
star
57

terraform-demux

A user-friendly launcher (à la bazelisk) for Terraform.
Go
12
star
58

logstash-plugins

Ruby
11
star
59

jenkins-triggering-user

Jenkins Plugin: Triggering User. Populates a $TRIGGERING_USER environment variable from the build cause and other sources, a best guess.
10
star
60

EtsyCompositionalLayoutBridge

iOS framework that allows for simultaneously leveraging flow layout and compositional layout in collection views
Swift
3
star
61

consulkit

Ruby API for interacting with HashiCorp's Consul.
Ruby
1
star
62

soft-circuits-workshop

Etsy Soft Circuits Workshop
Arduino
1
star