• Stars
    star
    198
  • Rank 196,898 (Top 4 %)
  • Language
    Go
  • License
    MIT License
  • Created over 7 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Rule based pod killing kubernetes controller

pod-reaper: kills pods dead

license release tests

A rules based pod killing container. Pod-Reaper was designed to kill pods that meet specific conditions. See the "Implemented Rules" section below for details on specific rules.

Configuring Pod Reaper

Pod-Reaper is configurable through environment variables. The pod-reaper specific environment variables are:

  • NAMESPACE the kubernetes namespace where pod-reaper should look for pods
  • GRACE_PERIOD duration that pods should be given to shut down before hard killing the pod
  • SCHEDULE schedule for when pod-reaper should look for pods to reap
  • RUN_DURATION how long pod-reaper should run before exiting
  • EVICT try to evict pods instead of deleting them
  • EXCLUDE_LABEL_KEY pod metadata label (of key-value pair) that pod-reaper should exclude
  • EXCLUDE_LABEL_VALUES comma-separated list of metadata label values (of key-value pair) that pod-reaper should exclude
  • REQUIRE_LABEL_KEY pod metadata label (of key-value pair) that pod-reaper should require
  • REQUIRE_LABEL_VALUES comma-separated list of metadata label values (of key-value pair) that pod-reaper should require
  • REQUIRE_ANNOTATION_KEY pod metadata annotation (of key-value pair) that pod-reaper should require
  • REQUIRE_ANNOTATION_VALUES comma-separated list of metadata annotation values (of key-value pair) that pod-reaper should require
  • DRY_RUN log pod-reaper's actions but don't actually kill any pods
  • MAX_PODS kill a maximum number of pods on each run
  • POD_SORTING_STRATEGY sorts pods before killing them (most useful when used with MAX_PODS)
  • LOG_LEVEL control verbosity level of log messages
  • LOG_FORMAT choose between several formats of logging

Additionally, at least one rule must be enabled, or the pod-reaper will error and exit. See the Rules section below for configuring and enabling rules.

Example environment variables:

# pod-reaper configuration
NAMESPACE=test
SCHEDULE=@every 30s
RUN_DURATION=15m
EXCLUDE_LABEL_KEY=pod-reaper
EXCLUDE_LABEL_VALUES=disabled,false

# enable at least one rule
CHAOS_CHANCE=.001

NAMESPACE

Default value: "" (which will look at ALL namespaces)

Controls which kubernetes namespace the pod-reaper is in scope for the pod-reaper. Note that the pod-reaper uses an InClusterConfig which makes use of the service account that kubernetes gives to its pods. Only pods (and namespaces) accessible to this service account will be visible to the pod-reaper.

GRACE_PERIOD

Default value: nil (indicates to the use the default specified for pods)

Controls the grace period between a soft pod termination and a hard termination. This will determine the time between when the pod's containers are send a SIGTERM signal and when they are sent a SIGKILL signal. The format follows the go-lang time.duration format (example: "1h15m30s"). A duration of 0s can be considered a hard kill of the pod.

SCHEDULE

Default value: "@every 1m"

Controls how frequently pod-reaper queries kubernetes for pods. The format follows the upstream cron library https://godoc.org/github.com/robfig/cron. For most use cases, the interval format @every 1h2m3s is sufficient. But more complex use cases can make use of the * * * * * notation. The cron parser used can optionally support seconds if a sixth parameter is add. 12 * * * * * for example will run on the 12th second of every minute.

RUN_DURATION

Default value: "0s" (which corresponds to running indefinitely)

Controls the minimum duration that pod-reaper will run before intentionally exiting. The value of "0s" (or anything equivalent such as the empty string) will be interpreted as an indefinite run duration. The format follows the go-lang time.duration format (example: "1h15m30s"). Pod-Reaper will not wait for reap-cycles to finishing waiting and will exit immediately (with exit code 0) after the duration has elapsed.

Warnings about RUN_DURATION

  • pod-rescheduling: if the reaper completes, even successfully, it may be restarted depending on the pod-spec.
  • self-reaping: the pod-reaper can reap itself if configured to do so, this can cause the reaper to not run for the expected duration.

Recommendations:

One time run:

  • create a pod spec and apply it to kubernetes
  • make the pod spec has restartPolicy: Never
  • add an exclusion label and key using EXCLUDE_LABEL_KEY and EXCLUDE_LABEL_VALUES
  • make the pod spec for the reaper match an excluded label and key to prevent it from reaping itself

Sustained running:

  • do not use RUN_DURATION
  • manage the pod reaper via a deployment

EVICT

Use the Eviction API instead of pod deletion when reaping pods. The Eviction API will honor the disruption budget assigned to pods, and can for example be useful when reaping pods by duration to ensure that you don't reap all the pods of a specific deployment simultaneously, interrupting a published service. When a pod cannot be reaped due to a disruption budget, the reason will be logged as a warning.

EXCLUDE_LABEL_KEY and EXCLUDE_LABEL_VALUES

These environment variables are used to build a label selector to exclude pods from reaping. The key must be a properly formed kubernetes label key. Values are a comma-separated (without whitespace) list of kubernetes label values. Setting exactly one of the key or values environment variables will result in an error.

A pod will be excluded from the pod-reaper if the pod has a metadata label has a key corresponding to the pod-reaper's exclude label key, and that same metadata label has a value in the pod-reaper's list of excluded label values. This means that exclusion requires both the pod-reaper and pod to be configured in a compatible way.

REQUIRE_LABEL_KEY and REQUIRE_LABEL_VALUES

These environment variables build a label selector that pods must match in order to be reaped. Use them the same way as you would EXCLUDE_LABEL_KEY and EXCLUDE_LABEL_VALUES.

REQUIRE_ANNOTATION_KEY and REQUIRE_ANNOTATION_VALUES

These environment variables build a annotation selector that pods must match in order to be reaped. Use them the same way as you would EXCLUDE_LABEL_KEY and EXCLUDE_LABEL_VALUES.

DRY_RUN

Default value: unset (which will behave as if it were set to "false")

Acceptable values are 1, t, T, TRUE, true, True, 0, f, F, FALSE, false, False. Any other values will error. If the provided value is one of the "true" values then pod reaper will do select pods for reaper but will not actually kill any pods. Logging messages will reflect that a pod was selected for reaping and that pod was not killed because the reaper is in dry-run mode.

MAX_PODS

Default value: unset (which will behave as if it were set to "0")

Acceptable values are positive integers. Negative integers will evaluate to 0 and any other values will error. This can be useful to prevent too many pods being killed in one run. Logging messages will reflect that a pod was selected for reaping and that pod was not killed because too many pods were reaped already.

POD_SORTING_STRATEGY

Default value: unset (which will use the pod ordering return without specification from the API server). Accepted values:

  • (unset) - use the default ordering from the API server
  • random (case-sensitive) will randomly shuffle the list of pods before killing
  • oldest-first (case-sensitive) will sort pods into oldest-first based on the pods start time. (!! warning below).
  • youngest-first (case-sensitive) will sort pods into youngest-first based on the pods start time (!! warning below)
  • pod-deletion-cost (case-sensitive) will sort pods based on the pod deletion cost annotation.

!! WARNINGS !!

Pod start time is not always defined. In these cases, sorting strategies based on age put pods without start times at the end of the list. From my experience, this usually happens during a race condition with the pod initially being scheduled, but there may be other cases hidden away.

Using pod-reaper against the kube-system namespace can have some surprising implications. For example, during testing I found that the kube-schedule was owned by a master node (not a replicaset/daemon-set) and appeared to effectively ignore delete actions. The age returned from kubectl was reset, but the actual pod start time was unaffected. As a result of this, I found a looping scenario where the kube scheduler was effectively always the oldest pod.

In examples/pod-sorting-strategy.yml I mitigated this using by excluding on the label tier: control-plane

Logging

Pod reaper logs in JSON format using a logrus (https://github.com/sirupsen/logrus).

  • rule load: customer messages for each rule are logged when the pod-reaper is starting
  • reap cycle: a message is logged each time the reaper starts a cycle.
  • pod reap: a message is logged (with a reason for each rule) when a pod is flag for reaping.
  • exit: a message is logged when the reaper exits successfully (only is RUN_DURATION is specified)

LOG_LEVEL

Default value: Info

Messages this level and above will be logged. Available logging levels: Debug, Info, Warning, Error, Fatal and Panic

Example Log

{"level":"info","msg":"loaded rule: chaos chance .3","time":"2017-10-18T17:09:25Z"}
{"level":"info","msg":"loaded rule: maximum run duration 2m","time":"2017-10-18T17:09:25Z"}
{"level":"info","msg":"executing reap cycle","time":"2017-10-18T17:09:55Z"}
{"level":"info","msg":"reaping pod","pod":"hello-cloud-deployment-3026746346-bj65k","reasons":["was flagged for chaos","has been running for 3m6.257891269s"],"time":"2017-10-18T17:09:55Z"}
{"level":"info","msg":"reaping pod","pod":"example-pod-deployment-125971999cgsws","reasons":["was flagged for chaos","has been running for 2m55.269615797s"],"time":"2017-10-18T17:09:55Z"}
{"level":"info","msg":"executing reap cycle","time":"2017-10-18T17:10:25Z"}
{"level":"info","msg":"reaping pod","pod":"hello-cloud-deployment-3026746346-grw12","reasons":["was flagged for chaos","has been running for 3m36.054164005s"],"time":"2017-10-18T17:10:25Z"}
{"level":"info","msg":"pod reaper is exiting","time":"2017-10-18T17:10:46Z"}

LOG_FORMAT

Default value: Logrus

This environment variable modifies the structured log format for easy ingestion into different logging systems, including Stackdriver via the Fluentd format. Available formats: Logrus, Fluentd

Implemented Rules

CHAOS_CHANCE

Flags a pod for reaping based on a random number generator.

Enabled and configured by setting the environment variable CHAOS_CHANCE with a floating point value. A random number generator will generate a value in range [0,1) and if the the generated value is below the configured chaos chance, the pod will be flagged for reaping.

Example:

# every 30 seconds kill 1/100 pods found (based on random chance)
SCHEDULE=@every 30s
CHAOS_CHANCE=.01

Remember that pods can be excluded from reaping if the pod has a label matching the pod-reaper's configuration. See the EXCLUDE_LABEL_KEY and EXCLUDE_LABEL_VALUES section above for more details.

CONTAINER_STATUSES

Flags a pod for reaping based on a container within a pod having a specific container status.

Enabled and configured by setting the environment variable CONTAINER_STATUSES with a coma separated list (no whitespace) of statuses. If a pod is in either a waiting or terminated state with a status in the specified list of status, the pod will be flagged for reaping.

Example:

# every 10 minutes, kill all pods with a container with a status ImagePullBackOff, ErrImagePull, or Error
SCHEDULE=@every 10m
CONTAINER_STATUSES=ImagePullBackOff,ErrImagePull,Error

Note that this will not catch statuses that are describing the entire pod like the Evicted status.

POD_STATUS

Flags a pod for reaping based on the pod status.

Enabled and configured by setting the environment variable POD_STATUSES with a coma separated list (no whitespace) of statuses. If the pod status in the specified list of status, the pod will be flagged for reaping.

Example:

# every 10 minutes, kill all pods with status ImagePullBackOff, ErrImagePull, or Error
SCHEDULE=@every 10m
POD_STATUSES=Evicted,Unknown

Note that pod status is different than container statuses as it checks the status of the overall pod rather than teh status of containers in the pod. The most obvious use case of this if dealing with Evicted pods.

MAX_DURATION

Flags a pod for reaping based on the pods current run duration.

Enabled and configured by setting the environment variable MAX_DURATION with a valid go-lang time.duration format (example: "1h15m30s"). If a pod has been running longer than the specified duration, the pod will be flagged for reaping.

UNREADY

Flags a pod for reaping based on the time the pod has been unready.

Enabled and configured by setting the environment variable MAX_UNREADY with a valid go-lang time.duration format (example: "10m"). If a pod has been unready longer than the specified duration, the pod will be flagged for reaping.

Running Pod-Reapers

Service Accounts

Pod reaper uses the permissions of the pod's service account to list and delete pods. Unless specified, the service account used will be the default service account in the pod's namespace. By default, and in most cases, the default service account will not have the neccessary permissions to list and delete pods.

  • Cluster Wide Permissions: example
  • Namespace Specific Permissions: example

Combining Rules

A pod will only be reaped if ALL rules flag the pod for reaping, but you can achieve reaping on OR logic by simply running another pod-reaper.

For example, in the same pod-reaper container:

CHAOS_CHANCE=.01
MAX_DURATION=2h

Means that 1/100 pods that also have a run duration of over 2 hours will be reaped. If you want 1/100 pods reaped regardless of duration and also want all pods with a run duration of over hours to be reaped, run two pod-reapers. one with: CHAOS_CHANCE=.01 and another with MAX_DURATION=2h.

Deployments

Multiple pod-reapers can be easily managed and configured with kubernetes deployments. It is encouraged that if you are using deployments, that you leave the RUN_DURATION environment variable unset (or "0s") to let the reaper run forever, since the deployment will reschedule it anyway. Note that the pod-reaper can and will reap itself if it is not excluded.

One Time Runs

You can run run pod-reaper as a one time, limited duration container by usable the RUN_DURATION environment variable. An example use case might be wanting to introduce a high degree of chaos into your kubernetes environment for a short duration:

# 30% chaos chance every 1 minute for 15 minutes
SCHEDULE=@every 1m
RUN_DURATION=15m
CHAOS_CHANCE=.3

More Repositories

1

goalert

Open source on-call scheduling, automated escalations, and notifications so you never miss a critical alert
Go
2,215
star
2

lorri

Your project's nix-env
Rust
990
star
3

strelka

Real-time, container-based file scanning at enterprise scale
Python
859
star
4

matrixprofile-ts

A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile
Python
734
star
5

flottbot

A chatbot framework written in Go. All configurations are made in YAML files, or inside scripts written in your favorite language.
Go
335
star
6

halogen

Automatically create YARA rules from malicious documents.
Python
205
star
7

portauthority

API that leverages Clair to scan Docker Registries and Kubernetes Clusters for vulnerabilities
Go
151
star
8

huntlib

A Python library to help with some common threat hunting data analysis operations
Python
131
star
9

mmk-ui-api

UI, API, and Scanner (Rules Engine) services for Merry Maker
TypeScript
119
star
10

row-types

A Haskell library for open records and variants using closed type families and type literals
Haskell
112
star
11

data-validator

A tool to validate data, built around Apache Spark.
Scala
101
star
12

libdart

A High Performance, Network Optimized, JSON Library
C++
80
star
13

XCBBuildServiceProxy

A framework to create proxies for XCBBuildService, which allows for custom Xcode build integrations.
Swift
74
star
14

strelka-ui

Strelka Web UI for File Submission and Analysis
JavaScript
51
star
15

graphql-liftoff

Generate GraphQL schema language from API specifications and more
TypeScript
44
star
16

nix-fetchers

A set of morally pure fetching builtins for Nix.
Python
43
star
17

grease

Automated Scripting Engine For the Modern Age
Python
42
star
18

react-native-svg-parser

Parses SVG files and converts to 'react-native-svg' format objects. NOTE: This project has been archived.
JavaScript
42
star
19

theta-idl

Define communication protocols between applications using algebraic data types.
Haskell
41
star
20

Threat-Hunting

Detection of obfuscated Powershell commands
Jupyter Notebook
41
star
21

f5-bigip-cookbook

Chef cookbook for F5 Big IP
Ruby
37
star
22

winnaker

An audit tool that tests the whole system functionality of Spinnaker
Python
33
star
23

go-arty

Go client library for Artifactory and Xray
Go
31
star
24

captains-log

A continuous integration plugin that helps organize release information in slack
JavaScript
31
star
25

attack-navigator-docker

A simple Docker container that serves the MITRE ATT&CK Navigator web app
Makefile
26
star
26

POSSUM

Java
24
star
27

reuse

A simple Golang app to test TCP and SSL/TLS session reuse.
Go
22
star
28

karmabot

A karmabot for Slack
Python
21
star
29

lite-for-jdbc

Lightweight library to simplify JDBC database access
Kotlin
21
star
30

edge-mac-integrations

A collection scripts and API interactions used by Target to simplify the user experience and make Jamf Pro Self Service the one stop shop for access, peripherals, and software.
Shell
19
star
31

statsd-kafka-backend

A Statsd backend for sending metrics to Kafka
JavaScript
18
star
32

network_interfaces_v2-cookbook

Chef cookbook for managing network interfaces on Ubuntu, RHEL and Windows
Ruby
17
star
33

impeller

Manage Helm charts in Kubernetes clusters.
Go
15
star
34

table-model

Supercharge your datagrid with TableModel
JavaScript
15
star
35

REDstack

REDstack - Hadoop as a service on OpenStack
Python
15
star
36

secured-yarn-cluster-ansible

Ansible playbook for provisioning secured yarn cluster
Ruby
14
star
37

casper-auto-provisioning

Shell
13
star
38

jenkins-docker-master

Dockerfile for Jenkins master
Shell
12
star
39

native_memory_allocator

A Kotlin library providing a simple, high-performance way to use off-heap native memory in JVM applications.
Kotlin
10
star
40

sccmosd-refresh-multitool

A method to migrate from Windows 7 (w/ BIOS) to Windows 10 (w/ UEFI) In A Single Task Sequence
PowerShell
9
star
41

markdown-inject

Add file or command output to markdown documents.
TypeScript
9
star
42

jenkins-docker-nginx

Dockerfile for NGINX frontend to Jenkins
Shell
9
star
43

intellidiff

Kotlin
9
star
44

emoji_manager

Custom emoji management for Enterprise Slack users
Kotlin
8
star
45

cloudpunch

A framework to performance test OpenStack at scale
Python
8
star
46

boots_of_haste

This script parses through an Nmap XML file and sends requests through Burp for every open port.
Python
7
star
47

gelvedere

Cli to deploy a Jenkins master
Go
7
star
48

jenkins-docker-api

An API to manage containerized Jenkins masters
Go
6
star
49

sensu-go-goalert

Sensu Go GoAlert Handler
Go
6
star
50

cartster

Target Partner's Commerce Basket Transfer Example App
JavaScript
6
star
51

consensource-compose

INACTIVE REPO! Please visit github.com/target/consensource
Shell
5
star
52

consensource-database

Reporting database
Rust
5
star
53

consensource-processor

Transaction processor, smart contracts
Rust
5
star
54

pacemaker-cookbook

Chef cookbook for managing pacemaker on RHEL
Ruby
5
star
55

coldsalt

(THIS REPO HAS BEEN ARCHIVED) API test automation
Python
4
star
56

jupyter-git-extension

Extension that adds basic git functionality to the Jupyter Notebook UI
JavaScript
4
star
57

DataStoreExplorer

Kotlin
3
star
58

plugin-for-rundeck-to-execute-sap-modules

Rundeck plugin for connecting to SAP systems for triggering ABAP programs and process chains
Java
3
star
59

flottbot-docs

Documentation for flottbot
JavaScript
3
star
60

mmk-js-scope

Puppeteer worker for Merry Maker
TypeScript
3
star
61

consensource-docs

WIP: Docsite
CSS
3
star
62

consensource-sds

An event subscriber for publishing blockchain events to an off-chain reporting database
Rust
3
star
63

osx-edgelab

Python
2
star
64

pull-request-code-coverage

A continuous integration plugin to allow detecting code coverage for only the lines changed in a PR.
Go
2
star
65

mmk-types

JavaScript
2
star
66

woozie

An Emacs package for creating and validating Apache Oozie workflows
Emacs Lisp
2
star
67

consensource-cli

CLI for testing transactions and genesis
Rust
2
star
68

concatenated-barcode-parser

This library has logic to parse GS1-128 (Global Standard 1) concatenated barcode and return a list of parsed objects
Kotlin
2
star
69

hdp-cloud

Ruby
1
star
70

burndown-for-github-projects

TypeScript
1
star
71

setupcfg2nix

Generate nix expressions from setup.cfg for a python package.
Python
1
star
72

k-sim

A simple simulator trying to work through bottleneck/constraints theory as applied to a few simple Kafka topologies.
JavaScript
1
star
73

compiler-of-android-for-lona

The Android Compiler for Lona Design Systems
FreeMarker
1
star
74

token-manager-for-salesforce

Spring Boot library to make Salesforce API calls easy
Java
1
star
75

chatops-docs

Terms & Conditions for using Slack at Target
HTML
1
star
76

Schema-Check-filter-for-Logstash

(This repo is archived) Schema Check filter for Logstash
Ruby
1
star