• Stars
    star
    158
  • Rank 235,750 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 4 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Great Expectations Airflow operator

Apache Airflow Provider for Great Expectations

A set of Airflow operators for Great Expectations, a Python library for testing and validating data.

Version Warning:

Due to apply_default decorator removal, this version of the provider requires Airflow 2.1.0+. If your Airflow version is < 2.1.0, and you want to install this provider version, first upgrade Airflow to at least version 2.1.0. Otherwise, your Airflow package version will be upgraded automatically, and you will have to manually run airflow upgrade db to complete the migration.

Notes on compatibility

  • This operator currently works with the Great Expectations V3 Batch Request API only. If you would like to use the operator in conjunction with the V2 Batch Kwargs API, you must use a version below 0.1.0
  • This operator uses Great Expectations Checkpoints instead of the former ValidationOperators.
  • Because of the above, this operator requires Great Expectations >=v0.13.9, which is pinned in the requirements.txt starting with release 0.0.5.
  • Great Expectations version 0.13.8 contained a bug that would make this operator not work.
  • Great Expectations version 0.13.7 and below will work with version 0.0.4 of this operator and below.

This package has been most recently unit tested with apache-airflow=2.4.3 and great-expectation=0.15.34.

Formerly, there was a separate operator for BigQuery, to facilitate the use of GCP stores. This functionality is now baked into the core Great Expectations library, so the generic Operator will work with any back-end and SQL dialect for which you have a working Data Context and Datasources.

Installation

Pre-requisites: An environment running great-expectations and apache-airflow- these are requirements of this package that will be installed as dependencies.

pip install airflow-provider-great-expectations

Depending on your use-case, you might need to add ENV AIRFLOW__CORE__ENABLE_XCOM_PICKLING=true to your Dockerfile to enable XCOM to pass data between tasks.

Usage

The operator requires a DataContext to run which can be specified either as:

  1. A path to a directory in which a yaml-based DataContext configuration is located
  2. A Great Expectations DataContextConfig object

Additonally, a Checkpoint may be supplied, which can be specified either as:

  1. The name of a Checkpoint already located in the Checkpoint Store of the specified DataContext
  2. A Great Expectations CheckpointConfig object

Although if no Checkpoint is supplied, a default one will be built.

The operator also enables you to pass in a Python dictionary containing kwargs which will be added/substituted to the Checkpoint at runtime.

Modules

Great Expectations Base Operator: A base operator for Great Expectations. Import into your DAG via:

from great_expectations_provider.operators.great_expectations import GreatExpectationsOperator

Previously Available Email Alert Functionality

The email alert functionality available in version 0.0.7 has been removed, in order to keep the purpose of the operator more narrow and related to running the Great Expectations validations, etc. There is now a validation_failure_callback parameter to the base operator's constructor, which can be used for any kind of notification upon failure, given that the notification mechanisms provided by the Great Expectations framework itself doesn't suffice.

Examples

See the example_dags directory for an example DAG with some sample tasks that demonstrate operator functionality.

The example DAG can be exercised in one of two ways:

With the open-source Astro CLI (recommended):

  1. Initialize a project with the Astro CLI

  2. Copy the example DAG into the dags/ folder of your astro project

  3. Copy the directories in the include folder of this repository into the include directory of your Astro project

  4. Copy your GCP credentials.json file into the base directory of your Astro project

  5. Add the following to your Dockerfile to install the airflow-provider-great-expectations package, enable xcom pickling, and add the required Airflow variables and connection to run the example DAG:

    RUN pip install --user airflow_provider_great_expectations
    ENV AIRFLOW__CORE__ENABLE_XCOM_PICKLING=True
    ENV GOOGLE_APPLICATION_CREDENTIALS=/usr/local/airflow/credentials.json
    ENV AIRFLOW_VAR_MY_PROJECT=<YOUR_GCP_PROJECT_ID>
    ENV AIRFLOW_VAR_MY_BUCKET=<YOUR_GCS_BUCKET>
    ENV AIRFLOW_VAR_MY_DATASET=<YOUR_BQ_DATASET>
    ENV AIRFLOW_VAR_MY_TABLE=<YOUR_BQ_TABLE>
    ENV AIRFLOW_CONN_MY_BIGQUERY_CONN_ID='google-cloud-platform://?extra__google_cloud_platform__scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fbigquery&extra__google_cloud_platform__project=bombora-dev&extra__google_cloud_platform__key_path=%2Fusr%2Flocal%2Fairflow%2Fairflow-gcp.bombora-dev.iam.gserviceaccount.com.json'
    
  6. Run astro dev start to view the DAG on a local Airflow instance (you will need Docker running)

With a vanilla Airflow installation:

  1. Add the example DAG to your dags/ folder
  2. Make the great_expectations and data directories in include/ available in your environment.
  3. Change the data_file and ge_root_dir paths in your DAG file to point to the appropriate places.
  4. Change the paths in great-expectations/checkpoints/*.yml to point to the absolute path of your data files.
  5. Change the value of enable_xcom_pickling to true in your airflow.cfg
  6. Set the appropriate Airflow variables and connection as detailed in the above instructions for using the astro CLI

Development

Setting Up the Virtual Environment

Any virtual environment tool can be used, but the simplest approach is likely using the venv tool included in the Python standard library.

For example, creating a virtual environment for development against this package can be done with the following (assuming bash):

# Create the virtual environment using venv:
$ python -m venv --prompt my-af-ge-venv .venv

# Activate the virtual environment:
$ . .venv/bin/activate

# Install the package and testing dependencies:
(my-af-ge-venv) $ pip install -e '.[tests]'

Running Unit, Integration, and Functional Tests

Once the above is done, running the unit and integration tests can be done with either of the following approaches.

Using pytest

The pytest library and CLI is preferred by this project, and many Python developers, because of its rich API, and the additional control it gives you over things like test output, test markers, etc. It is included as a dependency in requirements.txt.

The simple command pytest -p no:warnings, when run in the virtual environment created with the above process, provides a concise output when all tests pass, filtering out deprecation warnings that may be issued by Airflow, and a only as detailed as necessary output when they dont:

(my-af-ge-venv) $ pytest -p no:warnings
=========================================================================================== test session starts ============================================================================================
platform darwin -- Python 3.7.4, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /Users/jpayne/repos-bombora/bombora-airflow-provider-great-expectations, configfile: pytest.ini, testpaths: tests
plugins: anyio-3.3.0
collected 7 items

tests/operators/test_great_expectations.py .......                                                                                                                                                   [100%]

============================================================================================ 7 passed in 11.99s ============================================================================================

Functional Testing

Functional testing entails simply running the example DAG using, for instance, one of the approaches outlined above, only with the adjustment that the local development package be installed in the target Airflow environment.

Again, the recommended approach is to use the Astro CLI

**This operator is in early stages of development! Feel free to submit issues, PRs, or join the #integration-airflow channel in the Great Expectations Slack for feedback. Thanks to Pete DeJoy and the Astronomer.io team for the support.

More Repositories

1

dag-factory

Dynamically generate Apache Airflow DAGs from YAML configuration files
Python
1,154
star
2

airflow-guides

Guides and docs to help you get up and running with Apache Airflow.
JavaScript
797
star
3

astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code
Python
589
star
4

astronomer

Helm Charts for the Astronomer Platform, Apache Airflow as a Service on Kubernetes
Python
444
star
5

astro-cli

CLI that makes it easy to create, test and deploy Airflow DAGs to Astronomer
Go
348
star
6

astro-sdk

Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
Python
338
star
7

airflow-chart

A Helm chart to install Apache Airflow on Kubernetes
Python
252
star
8

ask-astro

An end-to-end LLM reference implementation providing a Q&A interface for Airflow and Astronomer
Python
187
star
9

airflow-dbt-demo

A repository of sample code to accompany our blog post on Airflow and dbt.
Python
165
star
10

astronomer-providers

Airflow Providers containing Deferrable Operators & Sensors from Astronomer
Python
134
star
11

ap-airflow

Astronomer Core Docker Images
Jinja
102
star
12

airflow-data-quality-demo

A repository of sample code to show data quality checking best practices using Airflow.
Python
71
star
13

airflow-provider-sample

A template repo for building and releasing Airflow provider packages.
Python
69
star
14

airflow-example-dags

Sample Airflow DAGs
Python
60
star
15

webinar-dag-writing-best-practices

Python
48
star
16

airflow-quickstart

Get started with Apache Airflow. Check the README for instructions on how to run your first DAGs today. 🚀
Python
46
star
17

airflow-provider-kafka

A provider package for kafka
Python
37
star
18

docs

This repository contains all content and code for Astro and Astronomer Software documentation.
Python
36
star
19

telescope

Python
30
star
20

ray-airflow-demo

Jupyter Notebook
29
star
21

dynamic-dags-tutorial

Python
27
star
22

cosmos-demo

Demo DAGs that show how to run dbt Core in Airflow using Cosmos
Python
25
star
23

airflow-provider-mlflow

An MLflow Provider Package for Apache Airflow
Python
25
star
24

airflow-ui

TypeScript
24
star
25

starship

Python
22
star
26

astro-provider-databricks

Orchestrate your Databricks notebooks in Airflow and execute them as Databricks Workflows
Python
21
star
27

airflow-testing-guide

Python
20
star
28

webinar-demos

19
star
29

deploy-action

Custom Github Actions
18
star
30

airflow-testing-skeleton

A skeleton project for testing Airflow code
Python
18
star
31

airflow-provider-fivetran-async

A new Airflow Provider for Fivetran, maintained by Astronomer and Fivetran
Python
18
star
32

airflow-covid-data

Sample Airflow DAGs to load data from the CovidTracking API to Snowflake via an AWS S3 intermediary.
Python
16
star
33

airflow-provider-duckdb

A provider package for DuckDB
Python
14
star
34

astronomer-fab-securitymanager

Security Manager for the Astronomer Airflow distribution
Python
12
star
35

apache-airflow-providers-transfers

Python
11
star
36

airflow-dbt-elt

This repo contains DAGs demonstrating a variety of ELT patterns using Airflow along with dbt.
Python
11
star
37

ap-vendor

Astronomer Vendor Images
Dockerfile
11
star
38

intro-to-airflow-webinar

Python
10
star
39

airflow-guide-passing-data-between-tasks

Python
10
star
40

terraform-google-astronomer-gcp

Intended for internal use: deploys all infrastructure required for Astronomer to run on GCP
HCL
10
star
41

cs-tutorial-msteams-callbacks

Example DAGs demonstrating how to implement alerting and notifications via Microsoft Teams
Python
9
star
42

astro-provider-venv

Easily create and use Python Virtualenvs in Apache Airflow
Go
9
star
43

astronomer-airflow-scripts

Waits for Apache Airflow database migrations to complete.
Python
9
star
44

terraform-aws-astronomer-aws

Deploys all infrastructure required for Astronomer to run on AWS. For a complete deployment, see https://github.com/astronomer/terraform-aws-astronomer-enterprise
HCL
9
star
45

airflow-scheduling-tutorial

Python
8
star
46

cosmos-example

Python
8
star
47

2-4-example-dags

Python
7
star
48

mlflow-example

Python
7
star
49

2-6-example-dags

Python
7
star
50

airflow-provider-pulumi

Python
6
star
51

academy-genai

Python
6
star
52

kedro-ge-airflow

Python
6
star
53

registry-dag-template

A template repository for contributing DAGs to the Astronomer Registry.
Python
6
star
54

dynamic-task-mapping-tutorial

Python
6
star
55

apache-airflow-providers-alembic

Python
6
star
56

webinar-secrets-management

Python
5
star
57

custom-xcom-backend-tutorial

Jupyter Notebook
5
star
58

airflow-sql-tutorial

Python
5
star
59

terraform-kubernetes-astronomer

Deploy Astronomer on Kubernetes
HCL
5
star
60

cs-tutorial-slack-callbacks

Example DAGs demonstrating how to implement alerting and notifications via Slack
Python
5
star
61

airflow-dags

Example DAGs for Airflow 2.9
Python
5
star
62

airflow-analytics-plugin

Python
5
star
63

terraform-provider-astro

Astro Terraform Provider
Go
5
star
64

airflow-databricks-tutorial

Python
4
star
65

azure-operator-tutorials

Python
4
star
66

education-sandbox

Codespace with Airflow and the Astro CLI
Python
4
star
67

terraform

Getting phased out - please use astronomer/terraform-* modules to track issues
HCL
4
star
68

cross-dag-dependencies-tutorial

Python
4
star
69

apache-airflow-providers-isolation

Python
4
star
70

cdc-cloudsql-airflow-demo

A repository of sample code to accompany our blog post on Change Data Capture and CloudSQL
Python
4
star
71

airflow-ldap-example

Example project for configuring opern source Airflow version with LDAP. Includes prepopulated OpenLDAP server
Python
4
star
72

airflow-wandb-demo

Python
3
star
73

spectra

Reusable UI components for Astronomer projects.
JavaScript
3
star
74

airflow-llm-demo

Python
3
star
75

2-7-example-dags

Python
3
star
76

airflow-snowpark-containers-demo

Python
3
star
77

astro-example-dags

Python
3
star
78

airflow-talend-tutorial

Tutorial for how to use Astronomer+Airflow with Talend. Contains reference DAGs and other supporting materials.
Python
3
star
79

azure_demo

Python
3
star
80

astro-gcp-onboarding

The script needed to set up a customer's Google Cloud Project for an Astro activation
Shell
3
star
81

greenplum-airflow-demo

Python
3
star
82

cs-astro-onboarding

Python
3
star
83

homebrew-tap

Homebrew Formulae to @astronomer binaries, powered by @astronomer
Ruby
3
star
84

debugging-dags-webinar

A repository containing the DAGs shown in the Debugging DAGs webinar on 2023-01-31.
Python
3
star
85

migrate-to-astro

Customer facing utilities to help customers migrating from Software/Nebula to Astro
Python
2
star
86

airflow-connection-docs

Guides and structured metadata about Airflow connections
Python
2
star
87

pass-data-between-tasks-webinar

The repository for example DAGs shown in the 2023-04-11 Astronomer webinar on passing data between tasks.
Python
2
star
88

pyconuk2022

Materials related to the PyCon UK Apache Airflow & Astro SDK workshop
Python
2
star
89

ds-ml-example-dags

Python
2
star
90

react-graphql-code-challenge

Code challenge for Front-End developers applying at Astronomer
JavaScript
2
star
91

airflow-sagemaker-tutorial

Python
2
star
92

databricks-ml-example

Jupyter Notebook
2
star
93

airflow_101_webinar

Repository for the Airflow 101 webinar on June 6th 2023.
Python
2
star
94

pagerduty_airflow_integration_benefits

Repo hosting PagerDuty + Airflow Integration Benefits Doc
2
star
95

sagemaker-batch-inference

Jupyter Notebook
2
star
96

airflow-adf-integration

An example DAG for orchestrating Azure Data Factory pipelines with Apache Airflow.
Python
2
star
97

astro-dbt-provider-tutorial-example

Example code for the dbt core Learn tutorial. The Astro dbt provider, also known as Cosmos, is a tool automatically integrate dbt models into your Airflow DAGs.
Python
2
star
98

cs-tutorial-reporting

How to Load reporting database for Airflow DAGS, DAG Runs, and Task Instances
Python
2
star
99

cosmos-dev

Python
2
star
100

llm-dags-dashboard

Repository for displaying LLM DAG runs status
HTML
2
star