EC2 Auto Scaling with Ansible
We use Ansible to manage application deployments to EC2 with Auto Scaling. It's particularly suited because it lends itself to easy integration with existing processes such as CI, enabling rapid development of a continuous deployment pipeline. One crucial feature is that it is able to hand-hold a rolling deploy (that is, zero downtime) by terminating and replacing instances in batches. Typically when we deploy to EC2, we do so in an automated fashion which makes it important to have rollback capability and for this, we typically maintain a short history of Amazon Machine Images (AMIs) and Launch Configurations which are associated with a particular Auto Scaling Group (ASG). In the event you wish to roll back to a particular version of your application, you can simply associate your ASG with the previously known working launch configuration and replace all your instances.
Our normal workflow for auto scaling deployments starts with an Ansible playbook which runs through the deploy lifecycle. Each step along the way is represented by a role and applied in order, keeping the main playbook lean and configurable. Depending on our client's requirements, that playbook might be triggered in a number of ways such as the final step in a continuous integration build, or on demand via Hubot in a Slack/Flowdock/IRC chat.
In this post we'll walk through each stage of the build and deployment process, and use Ansible to perform all the work. The goal is to build our entire environment from scratch, save for a few manually created resources at the outset.
Preparing AWS
We'll be using EC2 Classic for these examples, although they can be trivially adapted for VPC. Start by creating an EC2 Security Group for your application, taking care to open the necessary ports for your application in addition to TCP/22 for SSH.
Add a new keypair for SSH access to your instances. You can either create a new private/public keypair or upload your existing SSH public key.
You may optionally register and host a domain name with AWS Route 53. If you do so, the domain will be pointed at your application so that you don't have to browse to it by using an automatically assigned AWS hostname.
Setting up Ansible
Ansible uses Boto for AWS interactions, so you'll need that installed on your control host. We're also going to make some use of the AWS CLI tools, so get those too. Your platform may differ, but the following will work for most platforms:
pip install python-boto awscli
We also assume Ansible 1.9.x, for Ubuntu you can get that from the Ansible PPA.
add-apt-repository ppa:ansible/ansible
apt-get install ansible
You should place your AWS access/secret keys into ~/.aws/credentials
[Credentials]
aws_access_key_id = <your_access_key_here>
aws_secret_access_key = <your_secret_key_here>
We'll be using the ec2.py dynamic inventory script for Ansible so we can address our EC2 instances by various attributes instead of hard coding hostnames into an inventory file. It's not included with the Ubuntu distribution(s) of Ansible, so we'll grab it from GitHub. Place ec2.py and ec2.ini into /etc/ansible/inventory
(creating that directory if absent)
Modify /etc/ansible/ansible.cfg
to use that directory as the inventory source:
# /etc/ansible/ansible.cfg
inventory = /etc/ansible/inventory
Step 1: Launch a new EC2 instance
A prerequisite to setting up an application for auto scaling involves building an AMI containing your working application, which will be used to launch new instances to meet demand. We'll start by launching a new instance onto which we can deploy our application. Create the following files:
---
# group_vars/all.yml
region: us-east-1
zone: us-east-1a
keypair: YOUR_KEYPAIR
security_groups: YOUR_SECURITY_GROUP
instance_type: m3.medium
volumes:
- device_name: /dev/sda1
device_type: gp2
volume_size: 20
delete_on_termination: true
---
# deploy.yml
- hosts: localhost
connection: local
gather_facts: no
roles:
- role: launch
name: ami-build
---
# roles/launch/tasks/main.yml
- name: Search for the latest Ubuntu 14.04 AMI
ec2_ami_find:
region: "{{ region }}"
name: "ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-*"
owner: 099720109477
sort: name
sort_order: descending
sort_end: 1
no_result_action: fail
register: ami_result
- name: Launch new instance
ec2:
region: "{{ region }}"
keypair: "{{ keypair }}"
zone: "{{ zone }}"
group: "{{ security_groups }}"
image: "{{ ami_result.results[0].ami_id }}"
instance_type: "{{ instance_type }}"
instance_tags:
Name: "{{ name }}"
volumes: "{{ volumes }}"
wait: yes
register: ec2
- name: Add new instances to host group
add_host:
name: "{{ item.public_dns_name }}"
groups: "{{ name }}"
ec2_id: "{{ item.id }}"
with_items: ec2.instances
- name: Wait for instance to boot
wait_for:
host: "{{ item.public_dns_name }}"
port: 22
delay: 30
timeout: 300
state: started
with_items: ec2.instances
The ec2_ami_find module is a new addition to Ansible 2.0 but has not been backported to 1.9, so we'll need to import this module from GitHub and place it into the library/
directory relative to deploy.yml
.
Run the playbook with ansible-playbook deploy.yml -vv
and a new instance will be launched. You'll see it in the AWS Web Console and you should be able to SSH to it.
Step 2: Deploy the application
Now we'll use Ansible to deploy our application and start it. We'll deploy a sample Node.js web application, the source code of which is kept in a public git repository. Ansible is going to clone and checkout our application at a desired revision on the target instance and configure it to start on boot, in addition to setting up a web server.
---
# deploy.yml
- hosts: localhost
connection: local
gather_facts: no
roles:
- role: launch
name: ami-build
- hosts: ami-build
roles:
- deploy
- nginx
---
# roles/deploy/tasks/main.yml
- name: Install git
apt:
pkg: git
state: present
sudo: yes
- name: Create www directory
file:
path: /srv/www
owner: ubuntu
group: ubuntu
state: directory
sudo: yes
- name: Clone repository
git:
repo: "https://github.com/atplanet/hello-world-express-app.git"
dest: /srv/www/webapp
version: master
- name: Install upstart script
copy:
src: upstart.conf
dest: /etc/upstart/webapp.conf
sudo: yes
- name: Enable and start the application
service:
name: webapp
enabled: yes
state: restarted
sudo: yes
# roles/deploy/files/upstart.conf
description "Sample Node.js app"
author "Tom Bamford"
start on (local-filesystems and net-device-up)
stop on runlevel [06]
env IP="127.0.0.1"
env NODE_ENV="production"
setuid ubuntu
respawn
exec node /srv/www/webapp/app.js
---
# roles/nginx/tasks/main.yml
- name: Install Nginx
apt:
pkg: nginx
state: present
sudo: yes
- name: Configure Nginx
copy:
src: nginx.conf
dest: /etc/sites-enabled/default
sudo: yes
- name: Enable and start Nginx
service:
name: nginx
enabled: yes
state: restarted
sudo: yes
# roles/nginx/files/nginx.conf
server {
listen 80 default_server;
location / {
proxy_pass http://127.0.0.1:8000;
}
}
Running the playbook again will launch another instance, install some useful packages, deploy our application and set up Nginx as our web server. If you browse to the newest instance at its hostname, as reported in the output of ansible-playbook, you should see a "Hello World" page.
Step 3: Build the AMI
Now that the application is deployed and running, we can use the newly launched instance to build an AMI. Create the build-ami
role and amend the deploy.yml to invoke it.
---
# deploy.yml
- hosts: localhost
connection: local
gather_facts: no
roles:
- role: launch
name: ami-build
- hosts: ami-build
roles:
- deploy
- nginx
- hosts: localhost
connection: local
gather_facts: no
roles:
- create-ami
---
# roles/build-ami/tasks/main.yml
- name: Create AMI
ec2_ami:
region: "{{ region }}"
instance_id: "{{ ec2_id }}"
name: "webapp-{{ ansible_date_time.iso8601 | regex_replace('[^a-zA-Z0-9]', '-') }}"
wait: yes
state: present
register: ami
Step 4: Terminate old instances
You'll probably have noticed by now that each time the playbook is run, Ansible launches a new instance. At this rate, we'll keep accumulating instances that we don't need, so we will add another role and a new task to locate these instances and terminate them. Now, after Ansible successfully launches a new instance, it will terminate any existing instances immediately afterwards.
---
# deploy.yml
- name: Find existing instance(s)
hosts: "tag_Name_ami-build"
gather_facts: false
tags: find
tasks:
- name: Add to old-ami-build group
group_by:
key: old-ami-build
- hosts: localhost
connection: local
gather_facts: no
roles:
- role: launch
name: ami-build
- hosts: ami-build
roles:
- deploy
- nginx
- hosts: localhost
connection: local
gather_facts: no
roles:
- create-ami
- hosts: old-ami-build
roles:
- terminate
---
# roles/terminate/tasks/main.yml
- name: Terminate old instance(s)
ec2:
instance_ids: "{{ ec2_id }}"
region: "{{ region }}"
state: absent
wait: yes
Step 5: Create a Launch Configuration
Our AMI is built, so now we'll want to create a new Launch Configuration to describe the new instances that should be launched from this AMI. We'll create another role to handle that.
---
# deploy.yml
- name: Find existing instance(s)
hosts: "tag_Name_ami-build"
gather_facts: false
tags: find
tasks:
- name: Add to old-ami-build group
group_by:
key: old-ami-build
- hosts: localhost
connection: local
gather_facts: no
roles:
- role: launch
name: ami-build
- hosts: ami-build
roles:
- deploy
- nginx
- hosts: localhost
connection: local
gather_facts: no
roles:
- create-ami
- create-launch-configuration
- hosts: old-ami-build
roles:
- terminate
---
# roles/create-launch-configuration/tasks/main.yml
- name: Create Launch Configuration
ec2_lc:
region: "{{ region }}"
name: "webapp-{{ ansible_date_time.iso8601 | regex_replace('[^a-zA-Z0-9]', '-') }}"
image_id: "{{ ami.image_id }}"
key_name: "{{ keypair }}"
instance_type: "{{ instance_type }}"
security_groups: "{{ security_groups }}"
volumes: "{{ volumes }}"
instance_monitoring: yes
Step 6: Create an Elastic Load Balancer
Clients will connect to an Elastic Load Balancer which will distribute incoming requests among the instances we have launched into our upcoming Auto Scaling Group. Again we'll create another role to handle the management of the ELB, and apply it from our playbook.
---
# deploy.yml
- name: Find existing instance(s)
hosts: "tag_Name_ami-build"
gather_facts: false
tags: find
tasks:
- name: Add to old-ami-build group
group_by:
key: old-ami-build
- hosts: localhost
connection: local
gather_facts: no
roles:
- role: launch
name: ami-build
- hosts: ami-build
roles:
- deploy
- nginx
- hosts: localhost
connection: local
gather_facts: no
roles:
- create-ami
- create-launch-configuration
- load-balancer
- hosts: old-ami-build
roles:
- terminate
---
# roles/load-balancer/tasks/main.yml
- name: Configure Elastic Load Balancers
ec2_elb_lb:
region: "{{ region }}"
name: webapp
state: present
zones: "{{ zone }}"
connection_draining_timeout: 60
listeners:
- protocol: http
load_balancer_port: 80
instance_port: 80
health_check:
ping_protocol: http
ping_port: 80
ping_path: "/"
response_timeout: 10
interval: 30
unhealthy_threshold: 6
healthy_threshold: 2
register: elb_result
Step 7: Create and configure an Auto Scaling Group
We'll create an Auto Scaling Group and configure it to use the Launch Configuration we previously created. Within the boundaries that we define, AWS will launch instances into the ASG dynamically based on the current load across all instances. Equally when the load drops, some instances will be terminated accordingly. Exactly how many instances are launched or terminated is defined in one or more scaling policies, which are also created and linked to the ASG.
---
# deploy.yml
- name: Find existing instance(s)
hosts: "tag_Name_ami-build"
gather_facts: false
tags: find
tasks:
- name: Add to old-ami-build group
group_by:
key: old-ami-build
- hosts: localhost
connection: local
gather_facts: no
roles:
- role: launch
name: ami-build
- hosts: ami-build
roles:
- deploy
- nginx
- hosts: localhost
connection: local
gather_facts: no
roles:
- create-ami
- create-launch-configuration
- load-balancer
- auto-scaling
- hosts: old-ami-build
roles:
- terminate
---
# roles/auto-scaling/tasks/main.yml
- name: Retrieve current Auto Scaling Group properties
command: "aws --region {{ region }} autoscaling describe-auto-scaling-groups --auto-scaling-group-names webapp"
register: asg_properties_result
- name: Set asg_properties variable from JSON output if the Auto Scaling Group already exists
set_fact:
asg_properties: "{{ (asg_properties_result.stdout | from_json).AutoScalingGroups[0] }}"
when: (asg_properties_result.stdout | from_json).AutoScalingGroups | count
- name: Configure Auto Scaling Group and perform rolling deploy
ec2_asg:
region: "{{ region }}"
name: webapp
launch_config_name: webapp
availability_zones: "{{ zone }}"
health_check_type: ELB
health_check_period: 300
desired_capacity: "{{ asg_properties.DesiredCapacity | default(2) }}"
replace_all_instances: yes
replace_batch_size: "{{ (asg_properties.DesiredCapacity | default(2) / 4) | round(0, 'ceil') | int }}"
min_size: 2
max_size: 10
load_balancers:
- webapp
state: present
register: asg_result
- name: Configure Scaling Policies
ec2_scaling_policy:
region: "{{ region }}"
name: "{{ item.name }}"
asg_name: webapp
state: present
adjustment_type: "{{ item.adjustment_type }}"
min_adjustment_step: "{{ item.min_adjustment_step }}"
scaling_adjustment: "{{ item.scaling_adjustment }}"
cooldown: "{{ item.cooldown }}"
with_items:
- name: "Increase Group Size"
adjustment_type: "ChangeInCapacity"
scaling_adjustment: +1
min_adjustment_step: 1
cooldown: 180
- name: "Decrease Group Size"
adjustment_type: "ChangeInCapacity"
scaling_adjustment: -1
min_adjustment_step: 1
cooldown: 300
register: sp_result
- name: Determine Metric Alarm configuration
set_fact:
metric_alarms:
- name: "{{ asg_name }}-ScaleUp"
comparison: ">="
threshold: 50.0
alarm_actions:
- "{{ sp_result.results[0].arn }}"
- name: "{{ asg_name }}-ScaleDown"
comparison: "<="
threshold: 20.0
alarm_actions:
- "{{ sp_result.results[1].arn }}"
- name: Configure Metric Alarms and link to Scaling Policies
ec2_metric_alarm:
region: "{{ region }}"
name: "{{ item.name }}"
state: present
metric: "CPUUtilization"
namespace: "AWS/EC2"
statistic: "Average"
comparison: "{{ item.comparison }}"
threshold: "{{ item.threshold }}"
period: 60
evaluation_periods: 5
unit: "Percent"
dimensions:
AutoScalingGroupName: "{{ asg_name }}"
alarm_actions: "{{ item.alarm_actions }}"
with_items: metric_alarms
when: max_size > 1
register: ma_result
There's more going on here too. We not only configure our ASG and scaling policies, but also create CloudWatch metric alarms to measure the load across our instances, and associate them with the corresponding scaling policies to complete our configuration.
Here we have configured our CloudWatch alarms to trigger based on aggregate CPU usage within our auto scaling group. When the average CPU utilization exceeds 50% across your instances for 5 consecutive samples taken every 60 seconds (i.e. 5 minutes), a scaling event will be triggered that launches a new instance to relieve the load. A corresponding CloudWatch alarm also triggers a scaling event to terminate an instance from the auto scaling group when the average CPU utilization drops below 20% across your instances for the same sample period.
The minimum and maximum sizes for the auto scaling group are set to 2 and 10 respectively. It's important to get these values right for your application workload. You do not want to be under resourced for early peaks in traffic, and for redundancy reasons it's a good idea to always have at least 2 instances in service. Equally you probably want your application to scale for peak periods, but perhaps not beyond a safety limit in the event you receive massive amounts of traffic which could result in escalating costs.
Particularly important to note here is how we configure the ec2_asg
module to perform rolling deploys. First, we determine how many instances the ASG currently has running and use this to specify our desired_capacity
and calculate a suitable replace_batch_size
. The replace_all_instances
option specifies that all currently running instances should be replaced by new instances using the new Launch Configuration. Together, this ensures that the capacity of our ASG is not adversely affected during the deploy and allows us to safely deploy at any time, whether we are currently running 5 or 5000 instances! Of course this means that the more instances you have running, the longer the entire process will take. You may wish to increase the replace_batch_size
if you are consistently running more instances.
Step 8: Update DNS (optional)
If you have a domain name, or subdomain, set up with AWS Route 53, you can have Ansible update the DNS records to point to your Auto Scaling Group.
---
# deploy.yml
- name: Find existing instance(s)
hosts: "tag_Name_ami-build"
gather_facts: false
tags: find
tasks:
- name: Add to old-ami-build group
group_by:
key: old-ami-build
- hosts: localhost
connection: local
gather_facts: no
roles:
- role: launch
name: ami-build
- hosts: ami-build
roles:
- deploy
- nginx
- hosts: localhost
connection: local
gather_facts: no
roles:
- create-ami
- create-launch-configuration
- load-balancer
- auto-scaling
- dns
- hosts: old-ami-build
roles:
- terminate
---
# roles/dns/tasks/main.yml
- name: Update DNS
route53:
command: create
overwrite: yes
zone: "{{ domain }}"
record: "www.{{ domain }}"
type: CNAME
ttl: 300
value: "{{ elb_result.elb.dns_name }}"
Step 9: Cleaning up
Whilst we already configured Ansible to terminate old instances used for building AMIs, right now we will start to accumulate launch configurations and AMIs each time we invoke the deploy.yml
playbook. This might not appear to be much of a problem at the outset (financial costs aside), but it will soon become an issue due to service limits imposed by AWS. At the time of writing, the relevant limit on Launch Configurations was 100 per region. When this limit is reached, no more can be created and our playbook will start to fail.
Note that whilst you can request increased limits per region for your account, in our experience sometimes these requests are refused on the grounds that AWS would prefer for you to clean up your cruft instead of relying on perpetual service limit increases.
Leaving unused resources lying around is not very good practise in any case, and we certainly don't want to be paying for those resources unnecessarily. To fix this, we'll make use of the ec2_ami_find
/ec2_ami
modules to delete the older AMIs, and a quick and dirty (but effective) hand rolled module to discard old launch configurations.
---
# deploy.yml
- name: Find existing instance(s)
hosts: "tag_Name_ami-build"
gather_facts: false
tags: find
tasks:
- name: Add to old-ami-build group
group_by:
key: old-ami-build
- hosts: localhost
connection: local
gather_facts: no
roles:
- role: launch
name: ami-build
- hosts: ami-build
roles:
- deploy
- nginx
- hosts: ami-build
connection: local
gather_facts: no
roles:
- create-ami
- create-launch-configuration
- load-balancer
- auto-scaling
- dns
- hosts: localhost
connection: local
gather_facts: no
roles:
- delete-old-launch-configurations
- delete-old-amis
- hosts: old-ami-build
connection: local
gather_facts: no
roles:
- terminate
---
# roles/delete-old-amis/tasks/main.yml
- ec2_ami_find:
region: "{{ region }}"
owner: self
name: "webapp-*"
sort: name
sort_end: -10
register: old_ami_result
- ec2_ami:
region: "{{ region }}"
image_id: "{{ item.ami_id }}"
delete_snapshot: yes
state: absent
with_items: old_ami_result.results
ignore_errors: yes
---
# roles/delete-old-launch-configurations/tasks/main.yml
- lc_find:
region: "{{ region }}"
name_regex: "webapp-.*"
sort: yes
sort_end: -10
register: old_lc_result
- ec2_lc:
region: "{{ region }}"
name: "{{ item.name }}"
state: absent
with_items: old_lc_result.results
ignore_errors: yes
#!/usr/bin/python
# roles/delete-old-launch-configurations/library/lc_find.py
import json
import subprocess
def main():
argument_spec = ec2_argument_spec()
argument_spec.update(dict(
region = dict(required=True,
aliases = ['aws_region', 'ec2_region']),
name_regex = dict(required=False),
sort = dict(required=False, default=None, type='bool'),
sort_order = dict(required=False, default='ascending',
choices=['ascending', 'descending']),
sort_start = dict(required=False),
sort_end = dict(required=False),
)
)
module = AnsibleModule(
argument_spec=argument_spec,
)
name_regex = module.params.get('name_regex')
sort = module.params.get('sort')
sort_order = module.params.get('sort_order')
sort_start = module.params.get('sort_start')
sort_end = module.params.get('sort_end')
lc_cmd_result = subprocess.check_output(["aws", "autoscaling", "describe-launch-configurations", "--region", module.params.get('region')])
lc_result = json.loads(lc_cmd_result)
results = []
for lc in lc_result['LaunchConfigurations']:
data = {
'arn': lc["LaunchConfigurationARN"],
'name': lc["LaunchConfigurationName"],
}
results.append(data)
if name_regex:
regex = re.compile(name_regex)
results = [result for result in results if regex.match(result['name'])]
if sort:
results.sort(key=lambda e: e['name'], reverse=(sort_order=='descending'))
try:
if sort and sort_start and sort_end:
results = results[int(sort_start):int(sort_end)]
elif sort and sort_start:
results = results[int(sort_start):]
elif sort and sort_end:
results = results[:int(sort_end)]
except TypeError:
module.fail_json(msg="Please supply numeric values for sort_start and/or sort_end")
module.exit_json(results=results)
from ansible.module_utils.basic import *
from ansible.module_utils.ec2 import *
if __name__ == '__main__':
main()
When these roles are used together, Ansible will maintain a history of 10 AMIs and 10 Launch Configurations prior to the latest one of each. This will provide our rollback capability; in the event that you wish to roll back to an earlier deployed version of your application, you can update the active Launch Configuration in your Auto Scaling Group settings and replace your instances by terminating them in batches. Auto Scaling will start up new instances with your specified launch configuration in order to fulfill the desired instance count.
Win!
Now that we have a completed playbook to handle deployments of our application to EC2 Auto Scaling, all that remains is to hook it up to your existing systems to invoke it whenever you want a new deploy to occur. We'll cover that in a later blog post.
All the code from this article is available on GitHub.