• Stars
    star
    172
  • Rank 215,341 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR

Airflow Pipeline Docker Image Set-up

CI Status

This repo is a GitHub Actions build matrix set-up to generate Docker images of Airflow, and other major applications as below:

  • Airflow
  • Spark
  • Hadoop integration with Spark
  • Python
  • SQL Alchemy

Additionally, poetry is used to perform all Python related installations at a predefined global project directory, so that it is easy to add on new packages without conflicting dependency package versions, which raw pip cannot achieve. See https://github.com/dsaidgovsg/spark-k8s-addons#how-to-properly-manage-pip-packages for more information.

For builds involving Airflow v2 onwards, note that poetry is not officially supported as an installation tool, but it is used anyway to make sure dependencies are compatible and tested to work across multiple builds with different versions.

See apache/airflow#13149 for a related discussion and how to resolve possible conflicts when installing packages on top of this base image.

Entrypoint

Also, for convenience, the current version runs both the webserver and scheduler together in the same instance by the default entrypoint, with the webserver being at the background and scheduler at the foreground. All the convenient environment variables only works on the basis that the entrypoint is used without any extra command.

If there is a preference to run the various Airflow CLI services separately, you can simply pass the full command into the Docker command, but it will no longer trigger any of the convenient environment variables / functionalities.

The above convenience functionalities include:

  1. Discovering if database (sqlite and postgres) is ready
  2. Automatically running airflow db init and airflow db upgrade
  3. Easy creation of Airflow Web UI admin user by simple env vars.

See entrypoint.sh for more details and the list of convenient environment variables.

Also note that the command that will be run will also be run as airflow user/group, unless the host overrides the user/group to run the Docker container.

Running locally

You will need docker-compose and docker command installed.

Default Combined Airflow Webserver and Scheduler

docker-compose up --build

Navigate to http://localhost:8080/, and log in using the following RBAC credentials to try out the DAGs:

  • Username: admin
  • Password: Password123

Note that the webserver logs are suppressed by default.

CTRL-C to gracefully terminate the services.

Separate Airflow Webserver and Scheduler

docker-compose -f docker-compose.split.yml up --build

Navigate to http://localhost:8080/ to try out the DAGs.

Both webserver and scheduler logs are shown separately.

CTRL-C to gracefully terminate the services.

Versioning

Starting from Docker tags that give self-version v1, any Docker image usage related breaking change will generate a new self-version so that this will minimize any impact on the user-facing side trying to use the most updated image.

These are considered breaking changes:

  • Change of Linux distro, e.g. Alpine <-> Debian. This will automatically lead to a difference in the package management tool used such as apk vs apt. Note that however this does not include upgrading of Linux distro that may affect the package management, e.g. alpine:3.9 vs alpine:3.10.
  • Removal of advertized installed CLI tools that is not listed within the Docker tag. E.g. Spark and Hadoop are part of the Docker tag, so they are not part of the advertized CLI tools.
  • Removal of advertized environment variables
  • Change of any environment variable value

In the case where a CLI tool is known to perform a major version upgrade, this set-up will try to also release a new self-version number. But note that this is at a best effort scale only because most of the tools are inherited upstream, or simply unable / undesirable to specify the version to install.

Airflow provider packages

Airflow provider packages have been removed from the image from version v8 onwards and users will have to manually install them instead. Note that provider packages follow their own versioning independent of Airflow's.

See https://airflow.apache.org/docs/apache-airflow/2.1.0/backport-providers.html#backport-providers for more details.

# Airflow V2
poetry add apache-airflow-provider-apache-spark==1.0.3

# Airflow V1
poetry add apache-airflow[spark]==1.10.z

Changelogs

All self-versioned change logs are listed in CHANGELOG.md.

The advertized CLI tools and env vars are also listed in the detailed change logs.

How to Manually Build Docker Image

Example build command:

AIRFLOW_VERSION=2.3
SPARK_VERSION=3.3.0
HADOOP_VERSION=3.3.2
SCALA_VERSION=2.12
JAVA_VERSION=11
PYTHON_VERSION=3.9
SQLALCHEMY_VERSION=1.4
docker build -t airflow-pipeline \
  --build-arg "AIRFLOW_VERSION=${AIRFLOW_VERSION}" \
  --build-arg "SPARK_VERSION=${SPARK_VERSION}" \
  --build-arg "HADOOP_VERSION=${HADOOP_VERSION}" \
  --build-arg "SCALA_VERSION=${SCALA_VERSION}" \
  --build-arg "PYTHON_VERSION=${PYTHON_VERSION}" \
  --build-arg "JAVA_VERSION=${JAVA_VERSION}" \
  --build-arg "SQLALCHEMY_VERSION=${SQLALCHEMY_VERSION}" \
  .

You may refer to the vars.yml to have a sensing of all the possible build arguments to combine.

Caveat

Because this image is based on Spark with Kubernetes compatible image, which always generates Debian based Docker images, the images generated from this repository are likely to stay Debian based as well. But note that there is no guarantee that this is always true, but such changes are always marked with Docker image release tag.

Also, currently the default entrypoint without command logic assumes that a Postgres server will always be used (the default sqlite can work as an alternative). As such, when using in this mode, an external Postgres server has to be made available for Airflow services to access.

More Repositories

1

terraform-modules

Reusable Terraform modules
HCL
78
star
2

multimodal-learning-hands-on-tutorial

Jupyter Notebook
61
star
3

k-shortest-path

Implements K shortest path algorithms for networkx
Python
16
star
4

vigilantgantry

Face segmentation algorithm that the VigilantGantry uses to identify potential missed out causes by current thermal systems (due to occlusion from fringe, cap, head-gear).
Python
16
star
5

python-spark

Docker image for a Python installation with Spark, Hadoop and Sqoop binaries
15
star
6

TDBSCAN

TDBSCAN with Move Ability: Spatiotemporal Density Clustering
Python
9
star
7

spark-geo-privacy

Geospatial privacy functions for Apache Spark
Scala
6
star
8

yarn-reverse-proxy

Reverse proxy for the status pages of a YARN cluster
Shell
5
star
9

folium-resource-server

Simple server to just host folium JS and CSS resources
JavaScript
5
star
10

nomad-parametric-autoscaler

A customizable Nomad/EC2 auto-scaling service
JavaScript
5
star
11

stack-2022-differential-privacy-workshop

Jupyter Notebook
4
star
12

spark-k8s-addons

Dockerfile setup to install cloud related utilities onto the standard Spark K8s Docker images
Dockerfile
3
star
13

registrywatcher

Go
2
star
14

zeppelin

Zeppelin Dockerfile set-up with a wrapping dynamic GitHub releases JAR loader
Dockerfile
2
star
15

spark-custom-addons

Dockerfile Set-up to add dependencies into `spark-custom` images
Dockerfile
2
star
16

spark-k8s

CI set-up to generate Spark with Kubernetes Docker images
Shell
2
star
17

sg-tileserver-gl

Repackaged repository to build Docker image for Singapore tiles only
Shell
2
star
18

python-node

python-node
2
star
19

spark-base

Dockerfile setup for Spark set-up, imbued with varying degree of Python data science packages
Shell
2
star
20

spark-jupyterhub

Experimental and opinionated set-up to conveniently get going JupyterHub
Shell
1
star
21

zeppelin-jar-loader

Provides a simple JAR loader for dynamic loading of JAR files at the start for Zeppelin
Scala
1
star
22

spark-custom

Dockerfile set-up for building Spark releases from source code
Dockerfile
1
star
23

data-privacy-workshop

Jupyter Notebook
1
star
24

ljumphost

Proper jumphost set-up and instructions repository
Shell
1
star
25

benchmarking-differential-privacy-tools

Jupyter Notebook
1
star
26

spark-k8s-addons-ds

Dockerfile
1
star
27

kms-reencrypt

Python boto3 script to recursively KMS re-encrypt objects
Python
1
star
28

datavis-examples

Repo with sample dataset and codes to show some simple visualization.
Jupyter Notebook
1
star
29

ilytics

gunicorn, flask, and darknet model
C
1
star
30

folium-override-server

Webserver to re-generate uploaded folium map to use custom URLs
Python
1
star
31

ura-subzones

Simple library for dealing with URA subzones
JavaScript
1
star