• Stars
    star
    857
  • Rank 53,206 (Top 2 %)
  • Language
  • License
    MIT License
  • Created almost 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Simple place for people to provide examples of queries they've found useful.

Purpose

Prometheus is awesome, but the human mind doesn't work in PromQL. The intention of this repository is to become a simple place for people to provide examples of queries they've found useful. We encourage all to contribute so that this can become something valuable to the community.

Simple or complex, all input is welcome.

Further Reading

PromQL Examples

These examples are formatted as recording rules, but can be used as normal expressions.

Please ensure all examples are submitted in the same format, we'd like to keep this nice and easy to read and maintain. The examples may contain some metric names and labels that aren't present on your system, if you're looking to re-use these then make sure validate the labels and metric names match your system.


Show Overall CPU usage for a server

- record: instance:node_cpu_utilization_percent:rate5m
  expr: 100 * (1 - avg by(instance)(irate(node_cpu{mode='idle'}[5m])))

Summary: Often useful to newcomers to Prometheus looking to replicate common host CPU checks. This query ultimately provides an overall metric for CPU usage, per instance. It does this by a calculation based on the idle metric of the CPU, working out the overall percentage of the other states for a CPU in a 5 minute window and presenting that data per instance.


Track http error rates as a proportion of total traffic

- record: job_instance_method_path:demo_api_request_errors_50x_requests:rate5m
  expr: >
    rate(demo_api_request_duration_seconds_count{status="500",job="demo"}[5m]) * 50
      > on(job, instance, method, path)
    rate(demo_api_request_duration_seconds_count{status="200",job="demo"}[5m])

Summary: This query selects the 500-status rate for any job, instance, method, and path combinations for which the 200-status rate is not at least 50 times higher than the 500-status rate. The rate function has been used here as it's designed to be used with the counters in this query.

link: Julius Volz - Tutorial


90th Percentile latency

- record: instance:demo_api_90th_over_50ms_and_requests_over_1:rate5m
  expr: >
    histogram_quantile(0.9, rate(demo_api_request_duration_seconds_bucket{job="demo"}[5m])) > 0.05
      and
    rate(demo_api_request_duration_seconds_count{job="demo"}[5m]) > 1

Summary: Select any HTTP endpoints that have a 90th percentile latency higher than 50ms (0.05s) but only for the dimensional combinations that receive more than one request per second. We use the histogram_quantile() function for the percentile calculation here. It calculates the 90th percentile latency for each sub-dimension. To filter the resulting bad latencies and retain only those that receive more than one request per second. histogram_quantile is only suitable for usage with a Histogram metric.

link: Julius Volz - Tutorial


HTTP request rate, per second.. an hour ago

- record: instance:api_http_requests_total:offset_1h_rate5m
  expr: rate(api_http_requests_total{status=500}[5m] offset 1h)

Summary: The rate() function calculates the per-second average rate of time series in a range vector. Combining all the above tools, we can get the rates of HTTP requests of a specific timeframe. The query calculates the per-second rates of all HTTP requests that occurred in the last 5 minutes, an hour ago. Suitable for usage on a counter metric.

Link: Tom Verelst - Ordina


Kubernetes Container Memory Usage

- record: kubernetes_pod_name:container_memory_usage_bytes:sum
  expr: sum by(kubernetes_pod_name) (container_memory_usage_bytes{kubernetes_namespace="kube-system"})

Summary: How much memory are the tools in the kube-system namespace using? Break it down by Pod and NameSpace!

Link: Joe Bowers - CoreOS


Most expensive time series

- record: metric_name:metrics:top_ten_count
  expr: topk(10, count by (__name__)({__name__=~".+"}))

Summary: Which are your most expensive time series to store? When tuning Prometheus, these quries can help you monitor your most expensive metrics. Be cautious, this query is expensive to run.

Link: Brian Brazil - Robust Perception


Most expensive time series

- record: job:metrics:top_ten_count
  expr: topk(10, count by (job)({__name__=~".+"}))

Summary: Which of your jobs have the most timeseries? Be cautious, this query is expensive to run.

Link: Brian Brazil - Robust Perception


Which Alerts have been firing?

- record: alerts_fired:24h
  expr:   sort_desc(sum(sum_over_time(ALERTS{alertstate=`firing`}[24h])) by (alertname))

Summary: Which of your Alerts have been firing the most? Useful to track alert trends.


Alert Rules Examples

These are examples of rules you can use with Prometheus to trigger the firing of an event, usually to the Prometheus alertmanager application. You can refer to the official documentation for more information.

- alert: <alert name>
  expr: <expression>
  for: <duration>
  labels:
    label_name: <label value>
  annotations:
    annotation_name: <annotation value>

Disk Will Fill in 4 Hours

- alert: PreditciveHostDiskSpace
  expr: predict_linear(node_filesystem_free{mountpoint="/"}[4h], 4 * 3600) < 0
  for: 30m
  labels:
    severity: warning
  annotations:
    description: 'Based on recent sampling, the disk is likely to will fill on volume
      {{ $labels.mountpoint }} within the next 4 hours for instace: {{ $labels.instance_id
      }} tagged as: {{ $labels.instance_name_tag }}'
    summary: Predictive Disk Space Utilisation Alert

Summary: Asks Prometheus to predict if the hosts disks will fill within four hours, based upon the last hour of sampled data. In this example, we are returning AWS EC2 specific labels to make the alert more readable.


Alert on High Memory Load

- expr: (sum(node_memory_MemTotal) - sum(node_memory_MemFree + node_memory_Buffers + node_memory_Cached) ) / sum(node_memory_MemTotal) * 100 > 85

Summary: Trigger an alert if the memory of a host is almost full. This is done by deducting the total memory by the free, buffered and cached memory and dividing it by total again to obtain a percentage. The > 85 will only return when the resulting value is above 85.

Link: Stefan Prodan - Blog


Alert on High CPU utilisation

- alert: HostCPUUtilisation
  expr: 100 - (avg by(instance) (irate(node_cpu{mode="idle"}[5m])) * 100) > 70
  for: 20m
  labels:
    severity: warning
  annotations:
    description: 'High CPU utilisation detected for instance {{ $labels.instance_id
      }} tagged as: {{ $labels.instance_name_tag }}, the utilisation is currently:
      {{ $value }}%'
    summary: CPU Utilisation Alert

Summary: Trigger an alert if a host's CPU becomes over 70% utilised for 20 minutes or more.


Alert if Prometheus is throttling

- alert: PrometheusIngestionThrottling
  expr: prometheus_local_storage_persistence_urgency_score > 0.95
  for: 1m
  labels:
    severity: warning
  annotations:
    description: Prometheus cannot persist chunks to disk fast enough. It's urgency
      value is {{$value}}.
    summary: Prometheus is (or borderline) throttling ingestion of metrics

Summary: Trigger an alert if Prometheus begins to throttle its ingestion. If you see this, some TLC is required.


More Repositories

1

github-exporter

:octocat: Prometheus exporter for github metrics
Go
343
star
2

Guide_Rancher_Monitoring

Easy to follow guide on how to deploy and make the best use of the Rancher community catalog template for Prometheus.
113
star
3

prometheus-rancher-exporter

Exposes Rancher metrics to Prometheus
Go
99
star
4

docker-hub-exporter

Prometheus exporter for the Docker Hub
Go
76
star
5

101-Sessions

Documentation and support material for IW 101 Sessions
JavaScript
52
star
6

hpilo-exporter

Prometheus HP iLO exporter
Python
47
star
7

snow-cannon

An infrastructure as code approach to deploying Snowflake using Terraform
HCL
23
star
8

gar-exporter

Google Analytics Reporting API V4
Python
16
star
9

music

A REST api demo project that generates sequence diagrams from tests
Go
13
star
10

moby-container-stats

Prometheus Exporter - container stats exposed directly from the moby stats API
Go
12
star
11

docker-cloud-exporter

Prometheus Exporter for Docker Cloud
Go
8
star
12

graf-db

Pre-Configured data container for Grafana dashboards based upon Prometheus & Rancher.
8
star
13

prom-conf

Prometheus config container
7
star
14

terraform-canary

HCL
7
star
15

data-academy-serverless-example

A repository to demonstrate a simple serverless repo for the data academy.
Shell
7
star
16

ranch-eye

Simple container stats using Rancher as the source.
6
star
17

serverless

Serverless Application Template
JavaScript
6
star
18

kubehack

Kubernetes Hack Night
5
star
19

iw-tech-test

Infinity Works Tech Test - Starter Projects
Java
4
star
20

academy-tech-test

Python
4
star
21

iw-tech-test-py

Infinity Works Tech Test - Python Starter Project
Python
4
star
22

iw-data-test-python

Infinity Works Data Test - Starter Project
Python
4
star
23

generation-academy-tech-test

A sample tech test repo for the Generation Data Academy
Python
4
star
24

data-academy-cloudformation-example

A repository to demonstrate a simple CloudFormation repo for the data academy.
Python
3
star
25

iw-tech-test-dotnet

Infinity Works Tech Test - Starter Projects for .NET Core
C#
3
star
26

iw-data-test-python-pandas

Python
3
star
27

iw-tech-test-terraform-azure

Technical test for Engineers using azure terraform
Shell
3
star
28

aws-decryption-go

Decrypt AWS Encryption SDK formatted blobs using a custom private key
Go
2
star
29

iw-tech-test-platform

Infinity Works Platform Tech Test
Makefile
2
star
30

autovault

This project generates a raw vault model from metadata
Python
2
star
31

iw-tech-test-snowflake

Python
2
star
32

sample-jenkins-pipeline-job

Java
2
star
33

Bamboo-api-exporter

Open source toolkit for exporting data via bambooHR's API
Python
2
star
34

pact-demo

An example repo that shows a consumer of an api and the provider of that API being integrated with a locally hosted Pact Broker.
TypeScript
2
star
35

rsgo

ReadySteadyGo
2
star
36

email-signature-generator

A simple email signature generator
HTML
2
star
37

node-app-base

Basic utilities for running node microservices
JavaScript
2
star
38

awscli-terraform-snowflake

A docker image containing awscli, terraform and snowflake
Dockerfile
2
star
39

snailx_api

SnailX external team API.
Python
1
star
40

snailx_dev_env

Development environment repo for SnailX external team.
Makefile
1
star
41

iw-tech-test-terraform

Dockerfile
1
star
42

spike-aws-batch

Go
1
star
43

javascript-academy-tech-test

HTML
1
star
44

iw-tech-test-nodejs

Infinity Works Node.js Engineering Tech Test
JavaScript
1
star
45

go-workshop-practical

Go
1
star
46

go-common

Some small common go libs and utils
Go
1
star
47

bamboo-contact-sheet

A quick and dirty way to create an office contact sheet from Bamboo HR
Python
1
star
48

aws-sso-google-sync

Populate AWS SSO directly with your G Suite users and groups using either a CLI or AWS Lambda
Go
1
star
49

amazon-connect-flow-sync

Automated syncing of contact flows between a local directory and an amazon connect instance
JavaScript
1
star