• Stars
    star
    130
  • Rank 267,764 (Top 6 %)
  • Language
    Go
  • License
    Apache License 2.0
  • Created over 6 years ago
  • Updated 30 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An application to cycle (bounce) all nodes in a coordinated fashion in an AWS ASG or set of related ASGs

Autorelease

bouncer

Bouncer rebuilds your running infrastructure to make sure it matches the infrastructure you've defined in code. All releases can be downloaded from the release page, or automated in your code via bouncerw (see below).

This tool inspects AWS auto-scaling groups, and terminates, in a controlled fashion, any nodes whose launch templates or launch configurations don't match the one currently configured on the ASG it was launched from. It currently supports two termination methods, serial and canary; read more about them below.

Although the examples for invoking this from code below are written in Terraform, and it's convenient to be executed from within a Terraform environment, there is nothing Terraform-specific about this tool whatsoever.

Serial

Made for bouncing nodes in ASGs of size 1 but which form a logical set. ./bouncer serial --help for all available options. Ex:

./bouncer serial -a hashi-use1-stag-server-0,hashi-use1-stag-server-1,hashi-use1-stag-server-2

Will bounce all the servers in those ASGs one at a time, by invoking the API call autoscaling:TerminateInstanceInAutoScalingGroup which invokes the terminate shutdown hook (if there is one) and removes the instance from the ELB in a way that obeys the cooldown. This API call also decrements the desired capacity by 1 when it calls this (this is to prevent a race condition from the replacement node coming up before the one to be replaced has finished being terminated), then sets it back to what you 1 when it's done.

It will also wait between each bounce for the instance to become healthy again in the ASG (so make sure you've defined either EC2 or ELB health in a robust way for your use-case) before moving on to the next node in the next ASG.

Gotchas for serial mode

This bouncer is really only intended for the ASGs its given to each be of desired_capacity=1 (and therefore min_size=0). You must set your min_size to at least 1 less than desired_capacity. If you keep min_size, max_size and desired_capacity all at the same value, then issuing a TerminateInstanceInAutoScalingGroup will actually defer the termination of the instance in question until after its replacement has started and is running successfully, which is probably not what you want.

If you need to run serial mode against an ASG with an expected desired_capacity other than 1, you'll need to pass-in that number as part of the ASG list. For example, if we wanted to serially bounce nomad workers.

./bouncer serial -a hashi-use1-stag-worker-linux:3,hashi-use1-stag-worker-windows:2

Rolling

Rolling is the same as serial, but does not decrement. This means min_size, max_size, and desired_capacity can all be the same value.

Canary

Made for bouncing ASGs of arbitrary size where additional nodes and scale in before the old nodes have scaled out. ./bouncer canary --help for all available options. Ex:

./bouncer canary -a hashi-use1-stag-worker:3

Only accepts 1 ASG. In this example, the ASG must, at start time of bouncer, have desired_capacity of 3, and max_size of at least 6. This will

  • Bump desired capacity to 4 to create a "canary" node and wait for this node to become healthy.
    • If this script encounters an ASG which already has at least 1 new, healthy node, it skips the canary phase.
  • Increases desired capacity to 6 to give us a whole new set of new machines.
    • If your ASG already has more than 1 healthy new node, the desired capacity is only increased to give you 3 new nodes. That is, if bouncer started where 2 nodes were new and only 1 were old, it would skip the canary phase and instead immediately increase the desired_capacity to 4 to allow a 3rd new, healthy node, before terminating the lone old node.
  • Terminate the old nodes decrementing the desired_capacity with each terminate call.
    • Bouncer calls the ASG terminate API call with should-decrement-desired-capacity.
    • Bouncer doesn't wait for termination of nodes between terminate calls, but after all terminate calls have been issued, it waits for all nodes to finish terminating before exiting with success.

Gotchas for canary mode

  • You should have min_size = desired_capacity and have max_size = 2 * desired_capacity.
  • Your ASGs termination policy is ignored by this bouncer. This bouncer instead terminates all "old" nodes one at-a-time (but without waiting for them to complete in between), instead of letting AWS do it by scaling the ASG down. This is because the AWS scale-down will always scale down based on AZ mismatch before considering any other criteria.
  • If any new nodes exist at bouncer start time, they must be healthy in both the EC2 and ASG, otherwise bouncer will fail immediately.
  • A new node is only canaried if a healthy new node doesn't exist. That is, if this bouncer is re-run on an ASG which already has a healthy new node, this will only add the additional new nodes to meet your final capacity without doing the entire canary workflow again (this is to reduce churn on re-runs of the script).

Slow-canary

Similar to canary, but instead of adding a single canary host, waiting for it to become healthy, then adding the remainder, it only adds one new node at a time, destroying an old node as it goes. Can be useful to avoid quorum issues if halving a cluster all at once is too much churn.

Another way to think about it would be it's like serial mode for ASGs with more than 1 instance, where we don't ever want to drop below desired capacity number of nodes in service.

Ex:

./bouncer slow-canary -a hashi-use1-stag-server:3

Only accepts 1 ASG. In this example, the ASG must, at start time of bouncer, have desired_capacity of 3, and max_size of at least 4. This will

  • Bump desired capacity to 4 to create our first "canary" node and wait for this node to become healthy.
    • So that we will keep the number of active nodes at any given time to either 3 or 4, and not let it get to 2 or 5.
  • Terminates one of the old nodes, NOT decrementing the desired_capacity along with it, so that AWS can replace it with a fresh node.
  • Once nodes have settled and we have again 4 healthy nodes, we call terminate on another old node.
  • Once nodes have settled and we have again 4 healthy nodes (3 being new), we call terminate WITH should-decrement-desired-capacity to let us go back to our steady-state of 3 nodes.

New experimental batch modes

Eventually, batch-serial and batch-canary could potentially replace serial and canary with their default values. However, given that the logic in the new batch modes is significantly different than the older modes, they're implemented in parallel for now to prevent potential disruptions relying on the existing behaviour.

Batch-canary

Use-case is, you'd prefer to use canary, but you don't have capacity to double your ASG in the middle phase. Instead, you want to add new nodes in batches as you delete old nodes.

This method takes in a batch parameter. Your final desired capacity + batch determines the maximum bouncer will ever scale your ASG to. Bouncer will never scale your ASG below your desired capacity.

I.e. the core tenants of batch-canary are:

  • Your total InService node count will not go below your given desired capacity
  • Your total node count regardless of status will never go above (desired capacity + batch size)

NOTE: You should probably suspend the "AZ Rebalance" process on your ASG so that AWS doesn't violate these contraints either.

EX: You have an ASG of size 4. You don't have enough instance capacity to run 8 instances, but you do have enough to run 6. Invoke bouncer in batch-canary with a batchsize of 2 to accomplish this. This will

  • Bump desired capacity to 5 to create our first "canary" node.
  • Wait for this node to become healthy (InService).
  • Kill an old node. We will wait for this node to be completely gone before continuing.
  • Given we just killed a node, desired capacity is back to 4. Canary phase is done.
  • Set desired capacity to our max size, 6, which starts two new machines spawning.
  • Wait for these two nodes to become InService.
  • Kill 2 old nodes to get us back down to 4 desired capacity. We've now issued kills to 3 old nodes in total.
  • Again wait for the ASG to settle, which means waiting for the nodes we just killed to fully terminate.
  • Given we only have one old node now, we increase our desired capacity only up to 5, to give us one new node.
  • Once this node enters InService, we issue the terminate to the final old node.
  • Wait for the old node to totally finish terminating. Done!

Batch-serial

Use-case is, you'd prefer to use serial, but you have way too many instances so this takes too long. You can't use canary because your desired capacity is also your max capacity for external reasons (perhaps you tie a static EBS volume to every instance in this ASG).

This method takes in a batch parameter. batch determines the maximum number of instances bouncer will delete at any time from your desired capacity. Bouncer will never scale your ASG above your desired capacity.

I.e. the core tenants of batch-serial are:

  • Your total InService node count will not go below (desired capacity - batch size)
  • Your total node count regardless of status will never go above your given desired capacity

NOTE: You should probably suspend the "AZ Rebalance" process on your ASG so that AWS doesn't violate these contraints either.

EX: You have an ASG of size 4. You don't want to delete one instance at a time, but two at a time is ok. Set batch to 2. This mode still does canary a single node so you don't potentially batch a huge number of instances that might all fail to boot. This will

  • Terminate a single node, waiting for it to be fully destroyed.
  • Increase desired capacity back to original value.
  • Wait for this new node to become healthy. Canary phase is done.
  • Kill up to batchsize nodes, so in this case, 2. Wait for them to fully die.
  • Increase desired capacity back to original value, and wait for all nodes to come up healthy.
  • Kill last old node, wait for it to fully die.
  • Increase capacity back to original value, and wait for all nodes to become healthy.

Force bouncing all nodes

By default, the bouncer will ignore any nodes which are running the same launch template version (or same launch configuration) that's set on their ASG. If you've made a change external to the launch configuration / template and want the bouncer to start over bouncing all nodes regardless of launch config / template "oldness", you can add the -f flag to any of the run types. This flag marks any node whose launch time is older than the start time of the current bouncer invocation as "out of date", thus bouncing all nodes.

Running the bouncer in Terraform

  • Grab bouncerw at the top-level of this repo and place it in the top-level of your Terraform.
    • By top-level, this means the top-level of the whole repo, not of each module.
    • Terraform modules should not need to include this wrapper, and instead each caller of modules should have one copy of the wrapper at its top-level, and all modules or top-level code should use it automatically when invoked with ./bouncerw.
  • For each logical set of ASGs you'd like cycled in this way, add a null_resource block to call this wrapper script and pass-in the ASG(s) in question.
  • For example, if you're using launch tamplates, to cycle an ASG in canary mode, create a null_resource which triggers on a change to the launch template:
resource "null_resource" "server_canary_bouncer" {
  triggers {
    trigger = "${aws_launch_template.server.latest_version}"
  }

  provisioner "local-exec" {
    command = "./bouncerw canary -a '${aws_autoscaling_group.server.name}:${var.worker_count}'"
  }
}
  • If you need to use serial mode across multiple ASGs, something like this can work:
resource "null_resource" "server_serial_bouncer" {
  triggers {
    trigger = "${join(",", aws_launch_template.server.*.latest_version)}"
  }

  provisioner "local-exec" {
    command = "./bouncerw serial -a '${join(",", aws_autoscaling_group.server.*.name)}'"
  }
}
  • If you're using launch configs instead, it's similar, but the trigger needs to be the LC:
resource "null_resource" "server_bouncer" {
  triggers {
    trigger = "${aws_autoscaling_group.server.launch_configuration}"
  }

  provisioner "local-exec" {
    command = "./bouncerw canary -a '${aws_autoscaling_group.server.name}:${var.server_count}'"
  }
}
  • And of course the similar example in serial mode w/ multiple ASGs:
resource "null_resource" "server_canary_bouncer" {
  triggers {
    trigger = "${join(",", aws_autoscaling_group.server.*.launch_configuration)}"
  }

  provisioner "local-exec" {
    command = "./bouncerw serial -a '${join(",", aws_autoscaling_group.server.*.name)}'"
  }
}

Killswitch

As bouncer is usually going to run inside Terraform, there may be times when you want to apply all changes to your Terraform environment without actually invoking the bouncer. In order to do this, set the BOUNCER_KILLSWITCH environment variable to a non-empty value.

Running a command before instance is terminated

Sometimes, there may be an action that needs to be performed before an instance is removed from its ELB. For example, Vault listens in active/passive mode, so removing the master server from the main Vault ELB before the master has stepped-down its responsibilities to another node, means Vault is down until the lifecycle hooks kick-in and the master node steps-down.

You can use the flag -p to callout to an external command before every terminate call. The number of callouts must match the number of ASGs given to bouncer (since you will probably need to pass-in the ASG name or something else to your external command). Any waits or other checks you need before the instance is terminated should also be baked into this external command; bouncer will call terminate on the instance as soon as the external command returns success. See below chaining example for an example of using this with Vault.

These should be used sparingly, as most logic should be baked into your AMIs terminate hook; these should only contain logic that must run before ELB removal / draining.

Chaining bouncers together

If there are multiple ASGs in your repo which need to be bounced in a particular order, chain their associated null_resources together. Here I'm bouncing the Consul servers, then the Vault servers, then Nomad servers, and finally the Nomad workers, in order.

resource "null_resource" "consul_server_bouncer" {
  triggers {
    trigger = "..."
  }

  provisioner "local-exec" {
    command = "..."
  }
}

resource "null_resource" "vault_server_bouncer" {
  triggers {
    trigger = "..."
  }

  provisioner "local-exec" {
    command = "..."
  }

  depends_on = [
    "null_resource.consul_server_bouncer",
  ]
}

resource "null_resource" "nomad_server_bouncer" {
  triggers {
    trigger = "..."
  }

  provisioner "local-exec" {
    command = "..."
  }

  depends_on = [
    "null_resource.consul_server_bouncer",
    "null_resource.vault_server_bouncer",
  ]
}

resource "null_resource" "nomad_worker_bouncer" {
  triggers {
    trigger = "..."
  }

  provisioner "local-exec" {
    command = "..."
  }

  depends_on = [
    "null_resource.consul_server_bouncer",
    "null_resource.nomad_server_bouncer",
    "null_resource.vault_server_bouncer",
  ]
}

Required Permissions

In order to run the bouncer with launch templates, the following permissions are required:

autoscaling:DescribeAutoScalingGroups
autoscaling:CompleteLifecycleAction
autoscaling:TerminateInstanceInAutoScalingGroup
autoscaling:SetDesiredCapacity
ec2:DescribeInstances
ec2:DescribeInstanceAttribute
ec2:DescribeLaunchTemplates

For using bouncer with launch configurations, the required permissions are:

autoscaling:DescribeAutoScalingGroups
autoscaling:DescribeLaunchConfigurations
autoscaling:CompleteLifecycleAction
autoscaling:TerminateInstanceInAutoScalingGroup
autoscaling:SetDesiredCapacity
ec2:DescribeInstances
ec2:DescribeInstanceAttribute

Note that several of these permissions could cause service outages if abused. If this is a concern, scoping the permissions is recommended.

Contributing

For general guidelines on contributing the Palantir products, see this page

More Repositories

1

blueprint

A React-based UI toolkit for the web
TypeScript
19,885
star
2

tslint

🚦 An extensible linter for the TypeScript language
TypeScript
5,916
star
3

plottable

πŸ“Š A library of modular chart components built on D3
TypeScript
2,926
star
4

python-language-server

An implementation of the Language Server Protocol for Python
Python
2,579
star
5

windows-event-forwarding

A repository for using windows event forwarding for incident detection and response
Roff
1,186
star
6

pyspark-style-guide

This is a guide to PySpark code style presenting common situations and the associated best practices based on the most frequent recurring topics across the PySpark repos we've encountered.
Python
945
star
7

osquery-configuration

A repository for using osquery for incident detection and response
800
star
8

tslint-react

πŸ“™ Lint rules related to React & JSX for TSLint.
TypeScript
752
star
9

bulldozer

GitHub Pull Request Auto-Merge Bot
Go
725
star
10

gradle-docker

a Gradle plugin for orchestrating docker builds and pushes.
Groovy
723
star
11

policy-bot

A GitHub App that enforces approval policies on pull requests
Go
700
star
12

alerting-detection-strategy-framework

A framework for developing alerting and detection strategies for incident response.
610
star
13

stacktrace

Stack traces for Go errors
Go
498
star
14

docker-compose-rule

A JUnit rule to manage docker containers using docker-compose
Java
422
star
15

conjure

Strongly typed HTTP/JSON APIs for browsers and microservices
Java
393
star
16

palantir-java-format

A modern, lambda-friendly, 120 character Java formatter.
Java
373
star
17

eclipse-typescript

An Eclipse plug-in for developing in the TypeScript language.
JavaScript
341
star
18

gradle-git-version

a Gradle plugin that uses `git describe` to produce a version string.
Java
339
star
19

go-githubapp

A simple Go framework for building GitHub Apps
Go
317
star
20

godel

Go tool for formatting, checking, building, distributing and publishing projects
Go
304
star
21

gradle-baseline

A set of Gradle plugins that configure default code quality tools for developers.
Java
283
star
22

jamf-pro-scripts

A collection of scripts and extension attributes created for managing Mac workstations via Jamf Pro.
Shell
277
star
23

gradle-graal

A plugin for Gradle that adds tasks to download, extract and interact with GraalVM tooling.
Java
225
star
24

log4j-sniffer

A tool that scans archives to check for vulnerable log4j versions
Go
193
star
25

tfjson

Terraform plan file to JSON
Go
182
star
26

k8s-spark-scheduler

A Kubernetes Scheduler Extender to provide gang scheduling support for Spark on Kubernetes
Go
175
star
27

Sysmon

A lightweight platform monitoring tool for Java VMs
Java
155
star
28

typesettable

πŸ“ A typesetting library for SVG and Canvas
TypeScript
146
star
29

documentalist

πŸ“ A sort-of-static site generator optimized for living documentation of software projects
TypeScript
145
star
30

exploitguard

Documentation and supporting script sample for Windows Exploit Guard
PowerShell
144
star
31

gradle-consistent-versions

Compact, constraint-friendly lockfiles for your dependencies
Java
111
star
32

Cinch

A Java library that manages component action/event bindings for MVC patterns
Java
110
star
33

redoodle

An addon library for Redux that enhances its integration with TypeScript.
TypeScript
99
star
34

gradle-jacoco-coverage

Groovy
99
star
35

sqlite3worker

A threadsafe sqlite worker for Python
Python
94
star
36

phishcatch

A browser extension and API server for detecting corporate password use on external websites
CSS
83
star
37

python-jsonrpc-server

A Python 2 and 3 asynchronous JSON RPC server
Python
80
star
38

conjure-java-runtime

Opinionated libraries for HTTP&JSON-based RPC using Retrofit, Feign, OkHttp as clients and Jetty/Jersey as servers
Java
78
star
39

stashbot

A plugin for Atlassian Stash to allow easy, self-service continuous integration with Jenkins
Java
67
star
40

go-baseapp

A lightweight starting point for Go web servers
Go
67
star
41

stash-codesearch-plugin

Provides global repository, commit, and file content search for Atlassian Stash instances
Java
62
star
42

gradle-processors

Gradle plugin for integrating Java annotation processors
Groovy
62
star
43

go-java-launcher

A simple Go program for launching Java programs from a fixed configuration. This program replaces Gradle-generated Bash launch scripts which are susceptible to attacks via injection of environment variables of the form JAVA_OPTS='$(rm -rf /)'.
Go
59
star
44

pkg

A collection of stand-alone Go packages
Go
53
star
45

rust-zipkin

A library for logging and propagating Zipkin trace information in Rust
Rust
53
star
46

grunt-tslint

A Grunt plugin for tslint.
JavaScript
51
star
47

witchcraft-go-server

A highly opinionated Go embedded application server for RESTy APIs
Go
50
star
48

spark-influx-sink

A Spark metrics sink that pushes to InfluxDb
Scala
50
star
49

giraffe

Gracefully Integrated Remote Access For Files and Execution
Java
49
star
50

language-servers

[Deprecated and No longer supported] A collection of implementations for the Microsoft Language Server Protocol
Java
46
star
51

go-license

Go tool that applies and verifies that proper license headers are applied to Go files
Go
44
star
52

hadoop-crypto

Library for per-file client-side encyption in Hadoop FileSystems such as HDFS or S3.
Java
41
star
53

tritium

Tritium is a library for instrumenting applications to provide better observability at runtime
Java
39
star
54

sls-packaging

A set of Gradle plugins for creating SLS-compatible packages
Shell
38
star
55

roboslack

A pluggable, fluent, straightforward Java library for interacting with Slack.
Java
37
star
56

dropwizard-web-security

A Dropwizard bundle for applying default web security functionality
Java
37
star
57

goastwriter

Go library for writing Go source code programatically
Go
31
star
58

gradle-gitsemver

Java
31
star
59

palantir-python-sdk

Palantir Python SDK
Python
30
star
60

gradle-revapi

Gradle plugin that uses Revapi to check whether you have introduced API/ABI breaks in your Java public API
Java
29
star
61

checks

Go libraries and programs for performing static checks on Go projects
Go
29
star
62

gradle-circle-style

πŸš€πŸš€πŸš€MOVED TO Baseline
Java
28
star
63

trove

Patched version of the Trove 3 library - changes the Collections semantics to match proper java.util.Map semantics
Java
27
star
64

dialogue

A client-side RPC library for conjure-java
Java
27
star
65

atlasdb

Transactional Distributed Database Layer
Java
27
star
66

stylelint-config-palantir

Palantir's stylelint config
JavaScript
25
star
67

typedjsonrpc

A typed decorator-based JSON-RPC library for Python
Python
24
star
68

encrypted-config-value

Tooling for encrypting certain configuration parameter values in dropwizard apps
Java
22
star
69

typescript-service-generator

Java
21
star
70

streams

Utilities for working with Java 8 streams
Java
21
star
71

distgo

Go tool for building, distributing and publishing Go projects
Go
21
star
72

gradle-npm-run-plugin

Groovy
20
star
73

conjure-python

Conjure generator for Python clients
Java
19
star
74

conjure-java

Conjure generator for Java clients and servers
Java
19
star
75

amalgomate

Go tool for combining multiple different main packages into a single program or library
Go
19
star
76

serde-encrypted-value

A crate which wraps Serde deserializers and decrypts values
Rust
19
star
77

gradle-docker-test-runner

Gradle plugin for running tests in Docker environments
Groovy
19
star
78

gerrit-ci

Plugin for Gerrit enabling self-service continuous integration workflows with Jenkins.
Java
18
star
79

conjure-rust

Conjure support for Rust
Rust
18
star
80

conjure-typescript

Conjure generator for TypeScript clients
TypeScript
17
star
81

gpg-tap-notifier-macos

Show a macOS notification when GPG is waiting for you to tap/touch a security device (e.g. YubiKey).
Swift
17
star
82

tracing-java

Java library providing zipkin-like tracing functionality
Java
16
star
83

plottable-moment

Plottable date/time formatting library built on Moment.js
JavaScript
16
star
84

spark-tpcds-benchmark

Utility for benchmarking changes in Spark using TPC-DS workloads
Java
16
star
85

assertj-automation

Automatic code rewriting for AssertJ using error-prone and refaster
Java
15
star
86

metric-schema

Schema for standard metric definitions
Java
14
star
87

safe-logging

Interfaces and utilities for safe log messages
Java
14
star
88

resource-identifier

Common resource identifier specification for inter-application object sharing
Java
14
star
89

dropwizard-web-logger

WebLoggerBundle is a Dropwizard bundle used to help log web activity to log files on a server’s backend
Java
14
star
90

gradle-shadow-jar

Gradle plugin to precisely shadow either a dependency or its transitives
Groovy
14
star
91

gradle-miniconda-plugin

Plugin that sets up a Python environment for building and running tests using Miniconda.
Java
13
star
92

conjure-go-runtime

Go implementation of the Conjure runtime
Go
12
star
93

gulp-count

Counts files in vinyl streams.
CoffeeScript
12
star
94

gradle-configuration-resolver-plugin

Groovy
12
star
95

asana_mailer

A script that uses Asana's RESTful API to generate plaintext and HTML emails.
Python
12
star
96

human-readable-types

A collection of human-readable types
Java
11
star
97

dropwizard-index-page

A Dropwizard bundle that serves the index page for a single page application
Java
11
star
98

eclipse-less

An Eclipse plug-in for compiling LESS files.
Java
11
star
99

go-compiles

Go check that checks that Go source and tests compiles
Go
11
star
100

go-generate

Go tool that runs and verifies the output of go generate
Go
11
star