Odin Auto-Scaling Group Deployer
Deploy your 12-factor-applications to AWS easily and securely with the Odin. Odin is a AWS Step Function base on the step
framework that deploys services as Auto-Scaling Groups (ASG's) to AWS.
Odin's goals/requirements/features are:
- Ephemeral Blue/Green: create new instances, wait for them to become healthy, delete old instances; treating compute instances as disposable and ephemeral.
- Declarative: describe what a successful release looks like, not how to deploy it.
- Scalable: can scale both vertically (larger instances) and horizontally (more instances).
- Secure: resources are verified to ensure that they cannot be used accidentally or maliciously.
- Gracefully Fail: handle failures to recover and roll back with no/minimal impact to users.
- Configuration Parity: minimize divergence between production, staging and development environments by keeping releases as similar as possible.
- No Configuration: once Odin is deployed it requires no further configuration.
- Multi Account: one deployer for all AWS accounts.
Getting Started
Odin is implemented as an AWS Lambda Function and AWS Step Function that deploys by assuming a role into an AWS account. You can bootstrap these into AWS with:
git pull # pull down new code
./scripts/bootstrap
Testing with deploy-test
Odin includes a test project deploy-test
that has one service web
that starts an nginx server to be mounted behind a Elastic Load Balancer (ELB) and Application Load Balancer target group. The service instances have a security group and instance profile.
To create the AWS resources for deploy-test
:
./scripts/geo apply resources/deploy-test-resources.rb
Note: you will also have to tag the latest Ubuntu release with Name: ubuntu
and DeployWith: odin
A deploy-test
release file deployer-test-release.json
looks like:
{
"project_name": "coinbase/deploy-test",
"config_name": "development",
"subnets": ["test_private_subnet_a", "test_private_subnet_b"],
"ami": "ubuntu",
"services": {
"web": {
"instance_type": "t2.nano",
"security_groups": ["ec2::coinbase/deploy-test::development"],
"elbs": ["coinbase-deploy-test-web-elb"],
"profile": "coinbase-deploy-test",
"target_groups": ["coinbase-deploy-test-web-tg"]
}
}
}
The user data for the release is from the file deployer-test-release.json.userdata
:
#cloud-config
repo_update: true
repo_upgrade: all
packages:
- docker.io
runcmd:
- docker run -d -p 8000:80 nginx
To build a release for deploy-test
and send it to Odin we use the odin
executable:
odin deploy deploy-test-release.json
The odin
executable takes the release file, merges in the user data, attaches some meta-data like created_at
and `release_id, then send the release to the Odin step function that:
- validates the sent release and any referenced resources.
- creates a new auto-scaling group for
web
that starts an nginx server. - waits for the EC2 instances in the
web
ASG to become healthy w.r.t. the ASG, the ELB and the target group. This may take a few minutes. - Once healthy the ASGs from the previous release and terminate their instances.
This is the ephemeral blue/green where old instances are deleted and new servers created.
Odin Release
An Odin release is a request to deploy a Project-Configuration where:
- A Project is a code-base typically named with
org/name
. - A Configuration is the environment the project is being deployed into, e.g.
development
,production
.
Each release can define 1-to-many Services; each service is a logical group of servers, e.g. web
or worker
, that maps to a single auto-scaling group (ASG).
When Odin is sent a release, it moves it through a state machine:
- Validate: validate the release is correct.
- Lock: grabs a lock on project-configuration.
- ValidateResources: validate resources w.r.t. the project, configuration and service using them.
- Deploy: creates an ASG and other resource for each service.
- CheckHealthy: check to see if the new instances created are healthy w.r.t. their ASGs ELBs and target groups. If instances are seen to be terminating immediately halt release.
- CleanUpSuccess: if the release was a success, then delete the old ASGs.
- CleanUpFailure: if the release failed, delete the new ASGs.
- ReleaseLockFailure: try to release the lock and fail.
At each of these states it is possible to fail and then move towards a failure state. The typical failures are:
- BadReleaseError: The release sent was invalid because either its structure was incorrect, its values were invalid, or its resources were invalid.
- LockExistsError: Could not grab the lock because either another deploy for the project-configuration is currently going out, or a previous deploy left a lock in place.
- DeployError: Unable to create a new ASG or resource.
- HaltError: Halt was detected or instances were found terminating.
- TimeoutError: The deploy took too long and failed.
The end states are:
- Success: the release went went as planned.
- FailureClean: release was unsuccessful, but cleanup was successful, so AWS was left in good state.
- FailureDirty: release was unsuccessful, but cleanup failed so AWS was left in a bad state. This should never happen and should alert if this happens, and file a bug.
- It is possible to not end in one of these states if the state machine is incorrect. This is very bad, alert if this happens and file a bug.
Resources
A release uses resources that must exist and be configured correctly to be used for the project-configuration-service being deployed.
A release must have:
- an AMI defined with the
ami
key that can be either aName
tag or AMI ID e.g.ami-1234567
- Subnets defined with
subnets
key that is a list of eitherName
tags or Subnet IDs e.g.subnet-1234567
Both the above resources MUST have a tag DeployWith
that equals odin
.
Services can have:
- Security Groups defined with
security_groups
key is a list of security groupsName
tags - Elastic Load Balancers defined with
elbs
key is a list of ELB names - Application Load Balancer Target Groups defined with
target_groups
is a list of target group'sName
tags
All the above resources MUST be tagged with the ProjectName
, ConfigName
and ServiceName
of the release to ensure that resources are assigned correctly.
Services can also have an Instance Profile defined by the profile
key that is and instance profile Name
tag. The roles path MUST be equal to /<project_name>/<config_name>/<service_name>/
.
Scale
Odin makes it easy to scale both vertically and horizontally. To scale deploy-test
we add to the release:
{ ...
"services": {
"web": { ...
"instance_type": "c4.xlarge",
"ebs_volume_size": 20,
"ebs_volume_type": "gp2",
"ebs_device_name": "/dev/sda1",
"autoscaling": {
"min_size": 3,
"max_size": 5,
"spread": 0.2,
"max_terms": 1,
"policies": [
{
"type": "cpu_scale_up",
"threshold" : 25,
"scaling_adjustment": 2
},
{
"type": "cpu_scale_down",
"threshold" : 15,
"scaling_adjustment": -1
}
]
}
}
}
}
instance_type
is the EC2 instance type for the serviceebs_volume_size
,ebs_volume_type
,ebs_device_name
define the attached EBS volume in GB.
The autoscaling
key defines the horizontal scaling of a service:
- all calculations are bounded by
min_size
andmax_size
. - the
desired_capacity
is equal to themin_size
or capacity of the previously launched service - the actual number of instances launched is the
desired_capacity * (1 + spread)
- to be deemed the healthy the service must have
desired_capacity * (1 - spread)
- if the number of terminating is greater than or equal to
max_terms
(default0
), the release is immediately halts. policies
are defined above to increase thedesired_capacity
by 2 instances if the CPU goes above 25% and reduce by 1 instance if it drops below 15%.
Both spread
and max_terms
are useful when launching many instances because as scale increases the number of cloud errors increase.
User Data
Do not put sensitive data into user data. User data is easily accessible from the AWS console, difficult to secure with IAM, and very limited in size. Odin requires user data passed to it to be KMS encrypted, uploaded to S3, and a SHA256 be passed in the release to be checked. The userdata will still be accessible in plain text on a launch configuration and EC2 instances, so these precautions are more to protect tampering than secrets.
For any secret an instance needs access to, we recommend using Vault, AWS Parameter store, or KMS encrypted S3 authenticated by a service's instance profile.
The user data is KMS encrypted and uploaded to S3. Odin will replace some strings with information about the release, project, config and service, e.g.:
...
write_files:
- path: /
content: |
{{RELEASE_ID}}
{{PROJECT_NAME}}
{{CONFIG_NAME}}
{{SERVICE_NAME}}
Odin will replace {{PROJECT_NAME}}
with the name of the project and {{SERVICE_NAME}}
with the name of the service. This can be useful for getting service specific configuration and logging.
The odin
client will upload the user data for the services from the <release_file>.userdata
file, e.g. deployer-test-release.json.userdata
.
Timeout
A release can have a timeout
which is how long in seconds a release will wait for its services to become healthy. By default the timeout is 10 minutes, the max value would be around a year (31556926 seconds) since that is how long a step function can run.
Lifecycle
AWS provides Auto Scaling Group Lifecycle Hooks to detect and react to auto-scaling events. You can add the lifecycle hooks to the ASGs with:
{ ...
"lifecycle": {
"termhook" : {
"transition": "autoscaling:EC2_INSTANCE_TERMINATING",
"role": "asg_lifecycle_hooks",
"sns": "asg_lifecycle_hooks",
"heartbeat_timeout": 300
}
}
}
These can be used to gracefully shutdown instances, which is necessary if a service has long running jobs e.g. a worker
service.
Halt
Odin supports manually stopping a release while is it being deployed. Just execute:
odin halt deploy-test-release.json
This will:
- Find the running deploy for the project configuration
- Write a
halt
file to S3 - Wait for Odin to detect the halt file and fail the deploy
Halt does not guarantee that the release will not be deployed, if executed too late the release may still result in success.
DO NOT use Stop execution
of the Odin step function as it will not clean up resources and leave AWS in a bad state.
Security
Deployers are critical pieces of infrastructure as they may be used to compromise software they deploy. As such, we take security very seriously around the odin
and try to answer the following questions:
- Authentication: Who can deploy?
- Authorization: What can be deployed?
- Replay and Man-in-the-middle (MITM): Can some unauthorized person edit or reuse a release to change what is deployed?
- Audit: Who has done what, and when?
Authentication
The central authentication mechanisms are the AWS IAM permissions for step functions and S3.
By limiting the ec2:CreateAutoscalingGroup
, permissions the Odin function becomes the only way to deploy ASG's. Then limiting permissions to who can call states:StartExecution
for Odin limits who can deploy.
Ensuring that Odin's lambda can only access a single S3 bucket, further limits who can deploy with:
{
"Effect": "Allow",
"Action": [
"s3:GetObject*", "s3:PutObject*",
"s3:List*", "s3:DeleteObject*"
],
"Resource": [
"arn:aws:s3:::#{s3_bucket_name}/*",
"arn:aws:s3:::#{s3_bucket_name}"
]
},
{
"Effect": "Deny",
"Action": ["s3:*"],
"NotResource": [
"arn:aws:s3:::#{s3_bucket_name}/*",
"arn:aws:s3:::#{s3_bucket_name}"
]
},
The Odin step function also needs to decrypt the KMS encrypted user-data that is uploaded to S3. By default it is encrypted with the alias/aws/s3
key, but a custom KMS key can be used and wither an alias or arn can be added to user_data_kms_key
. A custom key will give a better audit trail, and can lock down who can release even more.
Who can execute the step function, and who can upload to S3 are the two permissions that guard who can deploy.
Authorization
All resources that can be used in a Odin deploy must opt-in using tags or paths. Additionally, service resources require specific tags or paths denoting which project/config/service can use them.
Assets uploaded to S3 are in the path /<ProjectName>/<ConfigName>
so limiting who can s3:PutObject
to a path can be used to limit what project-configs they can deploy or halt.
Replay and MITM
Each release the client generates a release release_id
, a created_at
date, and together also uploads the release to S3.
The odin
will reject any request where the created_at
date is not recent, or the release sent to the step function and S3 don't match. This means that if a user can invoke the step function, but not upload to S3 (or vice-versa) it is not possible to deploy old or malicious code.
Audit
Working out what happened and when is very useful for debugging and security response. Step functions make it easy to see the history of all executions in the AWS console and via API. S3 can log all access to cloud-trail, so collecting from these two sources will show all information about a deploy.
Continuing Deployment
There is always more to do:
- Allow LifeCycle Hooks to send to Cloudwatch.
- Subnet, AMI, life cycle and userdata overrides per service.
- Check EC2 instance limits and capacity before deploying.
- Slowly scale (Canary) instances up rather than all at once, e.g. deploy 1 instance check it is healthy then deploy the rest.
- Add ELB and Target Group error rates when checking healthy.
- Custom auto-scaling policy types.