• Stars
    star
    361
  • Rank 117,957 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 5 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

EKS Rolling Update is a utility for updating the launch configuration of worker nodes in an EKS cluster.

EKS Rolling Update

EKS Rolling Update

EKS Rolling Update is a utility for updating the launch configuration or template of worker nodes in an EKS cluster.

Build Status

Intro

EKS Rolling Update is a utility for updating the launch configuration or template of worker nodes in an EKS cluster. It updates worker nodes in a rolling fashion and performs health checks of your EKS cluster to ensure no disruption to service. To achieve this, it performs the following actions:

  • Pauses Kubernetes Autoscaler (Optional)
  • Finds a list of worker nodes that do not have a launch config or template that matches their ASG
  • Scales up the desired capacity
  • Ensures the ASGs are healthy and that the new nodes have joined the EKS cluster
  • Cordons the outdated worker nodes
  • Suspends AWS Autoscaling actions while update is in progress
  • Drains outdated EKS outdated worker nodes one by one
  • Terminates EC2 instances of the worker nodes one by one
  • Detaches EC2 instances from the ASG one by one
  • Scales down the ASG to original count (in case of failure)
  • Resumes AWS Autoscaling actions
  • Resumes Kubernetes Autoscaler (Optional)

Requirements

  • kubectl installed
  • KUBECONFIG environment variable set, or config available in ${HOME}/.kube/config per default
  • AWS credentials configured

IAM Requirements

The following IAM permissions are required:

autoscaling:DescribeAutoScalingGroups
autoscaling:TerminateInstanceInAutoScalingGroup
autoscaling:SuspendProcesses
autoscaling:ResumeProcesses
autoscaling:UpdateAutoScalingGroup
autoscaling:CreateOrUpdateTags
autoscaling:DeleteTags
ec2:DescribeLaunchTemplates
ec2:DescribeInstance

Installation

From PyPi

pip3 install eks-rolling-update

From source

virtualenv -p python3 venv
source venv/bin/activate
pip3 install -r requirements.txt

Usage

usage: eks_rolling_update.py [-h] --cluster_name CLUSTER_NAME [--plan]

Rolling update on cluster

optional arguments:
  -h, --help            show this help message and exit
  --cluster_name CLUSTER_NAME, -c CLUSTER_NAME
                        the cluster name to perform rolling update on
  --plan, -p            perform a dry run to see which instances are out of
                        date

Example:

eks_rolling_update.py -c my-eks-cluster

Configuration

Core Configuration

Environment Variable Description Default
RUN_MODE Overall strategy for handling multiple ASGs & identifying nodes to roll. See Run Modes section below 1
DRY_RUN If True, only a query will be run to determine which worker nodes are outdated without running an update operation False
CLUSTER_HEALTH_WAIT Number of seconds to wait after ASG has been scaled up before checking health of nodes with the cluster 90
CLUSTER_HEALTH_RETRY Number of attempts to validate the health of the cluster after ASG has been scaled 1
GLOBAL_MAX_RETRY Number of attempts of a node health or instance termination check 12
GLOBAL_HEALTH_WAIT Number of seconds to wait before retrying a health node health or instance termination check 20
BETWEEN_NODES_WAIT Number of seconds to wait after removing a node before continuing on 0

ASG & Node-Related Controls

Environment Variable Description Default
ASG_DESIRED_STATE_TAG Temporary tag which will be saved to the ASG to store the state of the EKS cluster prior to update eks-rolling-update:desired_capacity
ASG_ORIG_CAPACITY_TAG Temporary tag which will be saved to the ASG to store the state of the EKS cluster prior to update eks-rolling-update:original_capacity
ASG_ORIG_MAX_CAPACITY_TAG Temporary tag which will be saved to the ASG to store the state of the EKS cluster prior to update eks-rolling-update:original_max_capacity
ASG_NAMES List of space-delimited ASG names. Out of ASGs attached to the cluster, only these will be processed for rolling update. If this is left empty all ASGs of the cluster will be processed. ""
BATCH_SIZE # of instances to scale the ASG by at a time. When set to 0, batching is disabled. See Batching section 0
MAX_ALLOWABLE_NODE_AGE The max age each node allowed to be. This works with RUN_MODE 4 as node rolling is updating based on age of node 6
EXCLUDE_NODE_LABEL_KEYS List of space-delimited keys for node labels. Nodes with a label using one of these keys will be excluded from the node count when scaling the cluster. spotinst.io/node-lifecycle
ASG_USE_TERMINATION_POLICY Prefer ASG termination policy (instance terminate/detach handled by ASG according to configured termination policy) False
INSTANCE_WAIT_FOR_STOPPING Only wait for terminated instances to be in stopping or shutting-down state, instead of fully terminated or stopped False

K8S Node & Pod Controls

Environment Variable Description Default
K8S_AUTOSCALER_ENABLED If True Kubernetes Autoscaler will be paused before running update False
K8S_AUTOSCALER_NAMESPACE Namespace where Kubernetes Autoscaler is deployed default
K8S_AUTOSCALER_DEPLOYMENT Deployment name of Kubernetes Autoscaler cluster-autoscaler
K8S_AUTOSCALER_REPLICAS Number of replicas to scale back up to after Kubernentes Autoscaler paused 2
K8S_CONTEXT Context from the Kubernetes config to use. If this is left undefined the current-context is used None
K8S_PROXY_BYPASS Set to true to ignore HTTPS_PROXY and HTTP_PROXY and disable use of any configured proxy when talking to the K8S API False
TAINT_NODES Replace the default cordon-before-drain strategy with NoSchedule tainting, as a workaround for K8S < 1.19 prematurely removing cordoned nodes from Service-managed LoadBalancers False
EXTRA_DRAIN_ARGS Additional space-delimited args to supply to the kubectl drain function, e.g --force=true. See kubectl drain -h ""
ENFORCED_DRAINING If draining fails for a node due to corrupted PodDisruptionBudgets or failing pods, retry draining with --disable-eviction=true and --force=true for this node to prevent aborting the script. This is useful to get the rolling update done in development and testing environments and should not be used in productive environments since this will bypass checking PodDisruptionBudgets False

Run Modes

There are a number of different values which can be set for the RUN_MODE environment variable.

1 is the default.

Mode Number Description
1 Scale up and cordon/taint the outdated nodes of each ASG one-by-one, just before we drain them.
2 Scale up and cordon/taint the outdated nodes of all ASGs all at once at the beginning of the run.
3 Cordon/taint the outdated nodes of all ASGs at the beginning of the run but scale each ASG one-by-one.
4 Roll EKS nodes based on age instead of launch config (works with MAX_ALLOWABLE_NODE_AGE with default 6 days value).

Each of them have different advantages and disadvantages.

  • Scaling up all ASGs at once may cause AWS EC2 instance limits to be exceeded
  • Only cordoning the nodes on a per-ASG basis will mean that pods are likely to be moved more than once
  • Cordoning the nodes for all ASGs at once could cause issues if new pods needs to start during the process

Batching

EKS Rolling Update can batch scale-out the ASG to progressively reach the desired instance count before it begins draining the nodes.

This is intended for use in cases where a large ASG scale-out may result in instances failing to register with EKS. Such a scenario is more likely to occur with larger ASGs where (for example) a 100 instance ASG may be asked to scale to 200 (temporarily). Users may find that some instances never register, and this causes EKS Rolling Update to hang indefinitely waiting for the registered EKS node count to match the instance count.

If this happens, you may want to consider batching.

For example, if the ASG will be scaled from 100 instances to 200 instances, specifying a batch size of 10 will result in the ASG first scaling to 110, then 120, 130, etc instances until 200 is reached. Once the desired count is reached, the tool will proceed with the normal draining/scale-in operations.

Examples

  • Plan
$ python eks_rolling_update.py --cluster_name YOUR_EKS_CLUSTER_NAME --plan
  • Apply Changes
$ python eks_rolling_update.py --cluster_name YOUR_EKS_CLUSTER_NAME
  • Cluster Autoscaler

If using cluster-autoscaler, you must let eks-rolling-update know that cluster-autoscaler is running in your cluster by exporting the following environment variables:

$ export  K8S_AUTOSCALER_ENABLED=true \
          K8S_AUTOSCALER_NAMESPACE="${CA_NAMESPACE}" \
          K8S_AUTOSCALER_DEPLOYMENT="${CA_DEPLOYMENT_NAME}"
  • Disable operations on cluster-autoscaler
$ unset K8S_AUTOSCALER_ENABLED
  • Configure tool via .env file

Rather than using environment variables, you can use a .env file within your working directory to load updater settings. e.g:

$ cat .env
DRY_RUN=1

Docker

Although no public Docker image is currently published for this project, feel free to use the included Dockerfile to build your own image.

make docker-dist version=1.0.DEV

After building the image, run using the command

docker run -ti --rm \
  -e AWS_DEFAULT_REGION \
  -v "${HOME}/.aws:/root/.aws" \
  -v "${HOME}/.kube/config:/root/.kube/config" \
  eks-rolling-update:latest \
  -c my-cluster

Pass in any additional environment variables and options as described elsewhere in this file.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details

More Repositories

1

health-go

Library to provide basic healthcheck functionality to Go applications.
Go
521
star
2

klepto

Klepto is a tool for copying and anonymising data
Go
291
star
3

kangal

Run performance tests in Kubernetes cluster with Kangal
Go
160
star
4

goengine

Engine provides you all the capabilities to build an Event sourced application in go
Go
126
star
5

deblibs-gradle-plugin

A Gradle plugin that creates Github issue and Slack message for outdated dependencies so they can easily be tracked and manually upgraded.
Kotlin
73
star
6

kandalf

RabbitMQ to Kafka bridge
Go
72
star
7

stats-go

This is generic stats library that we at HelloFresh use in our projects to collect services' stats and then create monitoring dashboards to track activity and problems.
Go
43
star
8

engine

Engine provides you all the capabilities to build an Event sourced application.
PHP
38
star
9

go-readinglist

Useful links for golang
33
star
10

action-changed-files

GitHub Action for matrix generation based on changed files matched against regular expressions
Python
31
star
11

ansible-sssd-ldap

Jinja
22
star
12

github-cli

A CLI Tool to automate the creation of github repositories
Go
19
star
13

ansible-consul

Ansible Role to install and configure consul
Ruby
15
star
14

logging-go

Go
13
star
15

mysql-replication-monitor

A Python MySQL Replication Monitor with Slack and Email notifications
Python
11
star
16

zendesk-go

Zendesk API build in Go
Go
11
star
17

updater-go

Library that helps verifying/updating go binary with new version
Go
11
star
18

impala-monitor

This a simple Python daemon to monitor your Impala nodes.
Python
10
star
19

tableau-status-exporter

Python
10
star
20

crossengage-python-client

Python client for Crossengage's API
Python
7
star
21

phanes

This is the client generator for an identity provider
Go
7
star
22

jenkins-trigger-console

Is a simple python script that triggers jobs on remote Jenkins and follows the console output.
Python
7
star
23

ausraster

Taking the (╯°□°)╯︵ ┻━┻ out of working with documents
PHP
7
star
24

lentil

Smart and modular gulp wrapper
JavaScript
6
star
25

ansible-oauth2-proxy

An ansible role to install oauth2 proxy
Shell
6
star
26

android-deeplink

Deeplink library for HF Android projects
Kotlin
6
star
27

ansible-rabbitmq

HelloFresh extension of the RabbitMQ playbook which allows clustering.
Ruby
5
star
28

hf-tlsmon

Monitors TLS/SSL hosts
Go
5
star
29

kangal-jmeter

JMeter docker images specifically configured for Kangal
Dockerfile
5
star
30

ginger-middleware

A set of useful middleware for gin
Go
5
star
31

stats-php

Generic stats library collect services' stats and then create monitoring dashboards to track activity and problems
PHP
5
star
32

opentracing-php

[DEPRECATED] OpenTracing API for PHP http://opentracing.io
PHP
5
star
33

appboy-python-client

A Python client for the Appboy REST API
Python
4
star
34

reagieren

A set of adapters for message brokers
PHP
4
star
35

workstation-checks

Specs tests to check your Mac/linux compatibility with a set of best security practices.
Ruby
4
star
36

php70

Base docker image for php 7.0 web apps
Dockerfile
3
star
37

ansible-yum

Simple role to patch a RHEL/Centos system and install default packages.
3
star
38

ssm-cli

CLI for setting and retrieving secrets from AWS SSM
Python
2
star
39

weird-github-client

A weird GitHub client. It does... stuff. For instance, it generates this report.
JavaScript
2
star
40

hf-baseimage

Docker base image for production workloads tailored for HF usage.
Dockerfile
2
star
41

grpc-protoc-plugins

Pipeline and releases for gRPC PHP and Pythin plugins that are not available as prebuilt artifacts
2
star
42

docker-python-ci

Hosts the Python CI Dockerfile
Dockerfile
2
star
43

logstashbrcvr

Relay Logstash heartbeats over HTTP for availability monitoring.
Go
1
star
44

ansible-archive

Ansible role to archive files and folders
1
star
45

docker-awscli

Dockerized AWS CLI.
1
star
46

phpkafka

Binary artifacts for PHP Kafka extension
Shell
1
star
47

ansible-firefox

An ansible role to install multi version of firefox
Ruby
1
star
48

ansible-samson

Create and Manage zendesk Samson
Shell
1
star
49

docker-node-ci

Lightweight Docker container used for building deployment artefacts in Concourse for node apps
Dockerfile
1
star
50

osin-storage

osin-storage fork
Go
1
star
51

logpruner

Prunes logs stored in AWS ElasticSearch Service domains.
Go
1
star