• Stars
    star
    649
  • Rank 69,373 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 8 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A docker image and kubernetes config files to run Airflow on Kubernetes

kube-airflow (Celery Executor)

Docker Hub Docker Pulls Docker Stars

kube-airflow provides a set of tools to run Airflow in a Kubernetes cluster. This is useful when you'd want:

  • Easy high availability of the Airflow scheduler
  • Easy parallelism of task executions
    • The common way to scale out workers in Airflow is to utilize Celery. However, managing a H/A backend database and Celery workers just for parallelising task executions sounds like a hassle. This is where Kubernetes comes into play, again. If you already had a K8S cluster, just let K8S manage them for you.
    • If you have ever considered to avoid Celery for task parallelism, yes, K8S can still help you for a while. Just keep using LocalExecutor instead of CeleryExecutor and delegate actual tasks to Kubernetes by calling e.g. kubectl run --restart=Never ... from your tasks. It will work until the concurrent kubectl run executions(up to the concurrency implied by scheduler's max_threads and LocalExecutor's parallelism. See this SO question for gotchas) consumes all the resources a single airflow-scheduler pod provides, which will be after the pretty long time.

This repository contains:

  • Dockerfile(.template) of airflow for Docker images published to the public Docker Hub Registry.
  • airflow.all.yaml for manual creating Kubernetes services and deployments to run Airflow on Kubernetes
  • Helm Chart in ./airflow for deployments using Helm

Informations

Manual Installation

Create all the deployments and services to run Airflow on Kubernetes:

kubectl create -f airflow.all.yaml

It will create deployments for:

  • postgres
  • rabbitmq
  • airflow-webserver
  • airflow-scheduler
  • airflow-flower
  • airflow-worker

and services for:

  • postgres
  • rabbitmq
  • airflow-webserver
  • airflow-flower

Helm Deployment (recommended)

Ensure your helm installation is done, you may need to have TILLER_NAMESPACE set as environment variable.

Deploy to Kubernetes using:

make helm-install NAMESPACE=yournamespace HELM_VALUES=/path/to/you/values.yaml

Helm ingresses

The Chart provides ingress configuration to allow customization the installation by adapting the config.yaml depending on your setup.

Prefix

This Helm chart allows using a "prefix" string that will be added to every Kubernetes names. That allows instantiating several, independent Airflow clusters in the same namespace.

Note:

Do NOT use characters such as " (double quote), ' (simple quote), / (slash) or \ (backslash)
in your passwords and prefix and keep it as small as possible.

DAGs deployment: embedded DAGs or git-sync

This chart provide basically two way of deploying DAGs in your Airflow installation:

  • embedded DAGs
  • Git-Sync

This helm chart provide support for Persistant Storage but not for sidecar git-sync pod. If you are willing to contribute, do not hesitate to do a Pull Request !

Using embedded Git-Sync

Git-sync is the easiest way to automatically update your DAGs. It simply checks periodically (by default every minute) a Git project on a given branch and check this new version out when available. Scheduler and worker see changes almost real-time. There is no need to other tool and complex rolling-update procedure.

While it is extremely cool to see its DAG appears on Airflow 60s after merge on this project, you should be aware of some limitations Airflow has with dynamic DAG updates:

If the scheduler reloads a dag in the middle of a dagrun then the dagrun will actually start
using the new version of the dag in the middle of execution.

This is a known issue with airflow and it means it's unsafe in general to use a git-sync like solution with airflow without:

  • using explicit locking, ie never pull down a new dag if a dagrun is in progress
  • make dags immutable, never modify your dag always make a new one

Also keep in mind using git-sync may not be scalable at all in production if you have lot of DAGs. The best way to deploy you DAG is to build a new docker image containing all the DAG and their dependencies. To do so, fork this project

Airflow.cfg as ConfigMap

By default, we use the configuration file airflow.cfg hardcoded in the docker image. This file uses a custom templating system to apply some environmnet variable and feed the airflow processes with (basically it is just some sed).

If you want to use your own airflow.cfg file without having to rebuild a complete docker image, for example when testing new settings, there is a way to define this file in a Kubernetes configuration map:

  • you need to define your own Value file you will feed to helm with helm install -f myvalue.yaml
  • you need to enable init the node airflow.airflow_cfg.enable: true
  • you need to store the content of your airflow.cfg in the node airflow.airflow_cfg.data You can see at airflow/myvalue-with-airflowcfg-configmap.yaml for an example on how to set it in your config.yaml file
  • note it is important to keep the custom templating in your airflow.cfg (ex: {{ POSTGRES_CREDS }}) or at least keep it aligned with the configuration applyied in your Kubernetes Cluster.

Worker Statefulset

As you can see, Celery workers uses StatefulSet instead of deployment. It is used to freeze their DNS using a Kubernetes Headless Service, and allow the webserver to requests the logs from each workers individually. This requires to expose a port (8793) and ensure the pod DNS is accessible to the web server pod, which is why StatefulSet is for.

Embedded DAGs

If you want more control on the way you deploy your DAGs, you can use embedded DAGs, where DAGs are burned inside the Docker container deployed as Scheduler and Workers.

Be aware this requirement more heavy tooling than using git-sync, especially if you use CI/CD:

  • your CI/CD should be able to build a new docker image each time your DAGs are updated.
  • your CI/CD should be able to control the deployment of this new image in your kubernetes cluster

Example of procedure:

  • Fork this project
  • Place your DAG inside the dags folder of this project, update requirements-dags.txt to install new dependencies if needed (see bellow)
  • Add build script connected to your CI that will build the new docker image
  • Deploy on your Kubernetes cluster

You can avoid forking this project by:

  • keep a git-project dedicated to storing only your DAGs + dedicated requirements.txt

  • you can gate any change to DAGs in your CI (unittest, pip install -r requirements-dags.txt,.. )

  • have your CI/CD makes a new docker image after each successful merge using

    DAG_PATH=$PWD
    cd /path/to/kube-aiflow
    make ENBEDDED_DAGS_LOCATION=$DAG_PATH
    
  • trigger the deployment on this new image on your Kubernetes infrastructure

Python dependencies

If you want to add specific python dependencies to use in your DAGs, you simply declare them inside the requirements/dags.txt file. They will be automatically installed inside the container during build, so you can directly use these library in your DAGs.

To use another file, call:

make REQUIREMENTS_TXT_LOCATION=/path/to/you/dags/requirements.txt

Please note this requires you set up the same tooling environment in your CI/CD that when using Embedded DAGs.

Helm configuration customization

Helm allow to overload the configuration to adapt to your environment. You probably want to specify your own ingress configuration for instance.

Build Docker image

git clone this repository and then just run:

make build

Run with minikube

You can browse the Airflow dashboard via running:

minikube start
make browse-web

the Flower dashboard via running:

make browse-flower

If you want to use Ad hoc query, make sure you've configured connections: Go to Admin -> Connections and Edit "mysql_default" set this values (equivalent to values in config/airflow.cfg) :

  • Host : mysql
  • Schema : airflow
  • Login : airflow
  • Password : airflow

Check Airflow Documentation

Run the test "tutorial"

    kubectl exec web-<id> --namespace airflow-dev airflow backfill tutorial -s 2015-05-01 -e 2015-06-01

Scale the number of workers

For now, update the value for the replicas field of the deployment you want to scale and then:

    make apply

Wanna help?

Fork, improve and PR. ;-)

More Repositories

1

aws-secret-operator

A Kubernetes operator that automatically creates and updates Kubernetes secrets according to what are stored in AWS Secrets Manager.
Go
309
star
2

variant

Wrap up your bash scripts into a modern CLI today. Graduate to a full-blown golang app tomorrow.
Go
301
star
3

terraform-provider-eksctl

Manage AWS EKS clusters using Terraform and eksctl
Go
232
star
4

helm-x

Treat any Kustomization or K8s manifests directory as a Helm chart
Go
169
star
5

play2-memcached

A memcached plugin for Play 2.x
Scala
162
star
6

variant2

Turn your bash scripts into a modern, single-executable CLI app today
Go
134
star
7

terraform-provider-helmfile

Deploy Helmfile releases from Terraform
Go
115
star
8

kube-ssm-agent

Secure, access-controlled, and audited terminal sessions to EKS nodes without SSH
Dockerfile
106
star
9

kube-spot-termination-notice-handler

Moved to https://github.com/kube-aws/kube-spot-termination-notice-handler
Shell
97
star
10

concourse-aws

Auto-scaling Concourse CI v2.2.1 on AWS with Terraform (since 2016.04.14)
Go
72
star
11

play2-typescript

TypeScript assets handling for Play 2.0. Compiles .ts files under the /assets dir along with the project.
Scala
72
star
12

crossover

Minimal sufficient Envoy xDS for Kubernetes that knows https://smi-spec.io/
Go
71
star
13

helmfile-operator

Kubernetes operator that continuously syncs any set of Chart/Kustomize/Manifest fetched from S3/Git/GCS to your cluster
Go
62
star
14

okra

Hot-swap Kubernetes clusters while keeping your service up and running.
Go
48
star
15

decouple-apps-and-eks-clusters-with-tf-and-gitops

Smarty
31
star
16

kodedeploy

CodeDeploy for EKS. Parallel and Continuous (re)delivery to multiple EKS clusters. Blue-green deployment "of" EKS clusters.
Shell
29
star
17

config-registry

Switch between kubeconfigs and avoid unintentional operation on your production clusters.
Go
28
star
18

gosh

(Re)write your bash script in Go and make it testable too
Go
24
star
19

tiny-mmo

A work-in-progress example of server-side implementation of a tiny massive multi-player online game, implemented with Scala 2.10.2 and Akka 2.2
Java
23
star
20

geohash-scala

Geohash library for scala
Scala
23
star
21

prometheus-process-exporter

Helml chart for Prometheus process-exporter
Mustache
22
star
22

waypoint-plugin-helmfile

Helmfile deployment plugin for HashiCorp Waypoint
Go
15
star
23

kube-magic-ip-address

daemonset that assigns a magic IP address to pods. Useful for implementing node-local services on Kubernetes. Use in combination with dd-agent, dd-zipkin-proxy, kube2iam, kiam, and Elastic's apm-server
Shell
15
star
24

kube-fluentd

An ubuntu-slim/s6-overlay/confd based docker image for running fluentd in Kubernetes pods/daemonsets
Makefile
15
star
25

node-detacher

Practically and gracefully stop your K8s node on (termination|scale down|maintenance)
Go
13
star
26

kube-node-init

Kubernetes daemonset for node initial configuration. Currently for modifying files and systemd services on eksctl nodes without changing userdata
Smarty
13
star
27

lambda_bot

JavaScript
12
star
28

argocd-clusterset

A command-line tool and Kubernetes controller to sync EKS clusters into ArgoCD cluster secrets
Go
12
star
29

dcind

Docker Compose (and Daemon) in Docker
Shell
12
star
30

flux-repo

Enterprise-grade secrets management for GitOps
Go
10
star
31

brigade-helmfile-demo

Demo for building an enhanced GitOps pipeline with Flux, Brigade and Helmfile
JavaScript
9
star
32

kubeherd

A opinionated toolkit to automate herding your cattles = ephemeral Kubernetes clusters
Shell
9
star
33

sopsed

Spawning and storage of secure environments powered by sops, inspired from vaulted. Out-of-box support for kubectl, kube-aws, helm, helmfile
Go
9
star
34

helmfile-gitops

8
star
35

nodelocaldns

Temporary location of the nodelocaldns chart until it is upstreamed to helm/charts
Smarty
8
star
36

kube-logrotate

An ubuntu-slim/s6-overlay/confd based docker image for running logrotate via Kubernetes daemonsets
Makefile
7
star
37

idea-play

A plugin for IntelliJ IDEA which supports development of Play framework applications.
Java
7
star
38

actions-runner-controller-ci

Test repo for https://github.com/summerwind/actions-runner-controller
Go
6
star
39

kargo

Deploy your Kubernetes application directly or indirectly
Go
6
star
40

shoal

Declarative, Go-embeddable, and cross-platform package manager powered by https://gofi.sh/
Go
6
star
41

ingress-daemonset-controller

Build your own highly available and efficient external ingress loadbalancer on Kubernetes
Go
6
star
42

conflint

Unified lint runners for various configuration files
Go
5
star
43

fluent-plugin-datadog-log

Sends logs to Datadog Log Management: A port/translation of https://github.com/DataDog/datadog-log-agent to a Fluentd plugin
Ruby
5
star
44

play20-zentasks-tests

Added spec2 tests to Play20's 'zentasks
Scala
4
star
45

nomidroid

Android app to find nearby restaurants, bars, etc with Recruit Web API
Java
4
star
46

versapack

Versatile package manager. Version and host anything in Helm charts repositories
4
star
47

vote4music-play20

A Play 2.0 RC1 port of https://github.com/loicdescotte/vote4music
Scala
4
star
48

homebrew-anyenv

Ruby
4
star
49

terraform-provider-kubectl

Run kubectl against multiple K8s clusters in a single terraform-apply. AWS auth/AssumeRole supported.
Go
3
star
50

kube-node-index-prioritizing-scheduler

Kubernetes scheduler extender that uses sorted nodes' positions as priorities. Use in combination with resource request/limit to implement low-latency highly available front proxy cluster
Go
3
star
51

depot

Ruby
2
star
52

kokodoko

JavaScript
2
star
53

brigade-cd

Go
2
star
54

scripts

my scripts for linux machines
Shell
2
star
55

courier

Go
2
star
56

nodejs-chat

JavaScript
2
star
57

ooina

JavaScript
2
star
58

play-websocket-growl-sample

A sample application notifies the server on Growl, and other users via WebSocket
Java
2
star
59

heartbeat-operator

Monitor any http and tcp services via Kubernetes custom resources
Shell
2
star
60

brigade-automation

TypeScript
2
star
61

nbxmpfilter

xmpfilter for NetBeans
2
star
62

play-scala-lobby-example

Java/Scala mixed app with a WebSocketController in Java, while others are in Scala.
JavaScript
2
star
63

wy

Go
2
star
64

docker-squid-gem-proxy

Dockerfile for building docker image to run Squid-based reverse proxies to rubygems.org, for speeding-up `gem install` or `bundle install`
Shell
2
star
65

hcl2-yaml

YAML syntax for HashiCorp Configuration Language Version 2
Go
2
star
66

tweet_alarm_clock_android

Java
2
star
67

testkit

A toolkit for writing Go tests for your cloud apps, infra-as-code, and gitops configs
Go
1
star
68

elasticdrone

A set of tools to configure auto-scaled Drone.io cluster on AWS in ease
JavaScript
1
star
69

kubeaws-cicd-pipeline-golang

An example of the opinionated deployment pipeline for your Kubernetes-on-AWS-native apps. Secure, declarative, customizable, open app deployments at your hand.
Shell
1
star
70

akka-parallel-tasks

Scala
1
star
71

sing

1
star
72

good-code-is-red-example-for-idea-scala-plugin

Scala
1
star
73

share-snippets

JavaScript
1
star
74

scala-scalable-programming-sidenote

Scala
1
star
75

teleport-on-minikube

Shell
1
star
76

genrun

Go
1
star
77

nomidroid_test

tests for nomidroid the android app.
Java
1
star
78

web_smartphonizer

Scala
1
star
79

idobata4s

The Scala client library for idobata.io API
Scala
1
star
80

specs2-sandbox

Scala
1
star
81

kube-distributed-test-runner

Go
1
star
82

delimited-continuation

An example usage of Delimited Continuation in Scala,;implementing an async HTTP client.
Scala
1
star
83

play2-unit-tests-sample

1
star
84

syaml

Set YAML values with `cat your.yaml | syaml a.b.c newValue`
Go
1
star
85

scala-debian-package

1
star
86

glabel

A GOverlay implementation for putting text labels on GMap2
JavaScript
1
star
87

gitops-demo

Makefile
1
star
88

division

Control-plane of your microservices and Kubernetes clusters
Go
1
star
89

sam-containerized-go-example

Go
1
star
90

mumoauth

Yet another implementation of OAuth aiming to cover both OAuth 1.0a and 2, from Client to Server.
Scala
1
star
91

javascript-missiles-and-lasers

JavaScript
1
star
92

kube-veneur

Sidecar container to run local veneur and Kubernetes service for global https://github.com/stripe/veneur. Uses confd and s6-overlay under the hood.
Makefile
1
star
93

tweet_alarm_clock_android_test

Java
1
star
94

thread_local

An implementation of the thread-local variable for Ruby
Ruby
1
star
95

play-multijpa

a Play! plugin to switch databases for each play.db.jpa.Model subclass
Java
1
star
96

argocd-example-apps

Jsonnet
1
star
97

javascript-hue_circle

Draw a hue circle in JavaScript
JavaScript
1
star
98

logrus-bunyan-formatter

Logrus Bunyan Log Formatter
Go
1
star
99

crdb

Custom Resource DB - Kubernetes custom resources without Kubernetes or self-hosted databases. For any cloud, by leveraging cloudprovider-specific managed databases like AWS DynamoDB
Go
1
star