• Stars
    star
    205
  • Rank 187,440 (Top 4 %)
  • Language
  • License
    Apache License 2.0
  • Created about 5 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Useful macros when performing data audits

dbt-audit-helper

Useful macros when performing data audits

Contents

Installation instructions

New to dbt packages? Read more about them here.

  1. Include this package in your packages.yml file โ€” check here for the latest version number.
  2. Run dbt deps to install the package.

Macros

compare_relations (source)

This macro generates SQL that can be used to do a row-by-row validation of two relations. It is largely based on the equality test in dbt-utils. By default, the generated query returns a summary of audit results, like so:

in_a in_b count percent_of_total
True True 6870 99.74
True False 9 0.13
False True 9 0.13

Setting the summarize argument to false lets you check which rows do not match between relations:

order_id order_date status in_a in_b
1 2018-01-01 completed True False
1 2018-01-01 returned False True
2 2018-01-02 completed True False
2 2018-01-02 returned False True

This query is particularly useful when you want to check that a refactored model, or a model that you are moving over from a legacy system, match up.

Usage: The query is best used in dbt Develop so you can interactively check results

{# in dbt Develop #}

{% set old_etl_relation=adapter.get_relation(
      database=target.database,
      schema="old_etl_schema",
      identifier="fct_orders"
) -%}

{% set dbt_relation=ref('fct_orders') %}

{{ audit_helper.compare_relations(
    a_relation=old_etl_relation,
    b_relation=dbt_relation,
    exclude_columns=["loaded_at"],
    primary_key="order_id"
) }}

Arguments:

  • a_relation and b_relation: The relations you want to compare.
  • exclude_columns (optional): Any columns you wish to exclude from the validation.
  • primary_key (optional): The primary key of the model (or concatenated sql to create the primary key). Used to sort unmatched results for row-by-row validation.
  • summarize (optional): Allows you to switch between a summary or detailed view of the compared data. Accepts true or false values. Defaults to true.

compare_queries (source)

Super similar to compare_relations, except it takes two select statements. This macro is useful when:

  • You need to filter out records from one of the relations.
  • You need to rename or recast some columns to get them to match up.
  • You only want to compare a small number of columns, so it's easier write the columns you want to compare, rather than the columns you want to exclude.
{# in dbt Develop #}

{% set old_fct_orders_query %}
  select
    id as order_id,
    amount,
    customer_id
  from old_etl_schema.fct_orders
{% endset %}

{% set new_fct_orders_query %}
  select
    order_id,
    amount,
    customer_id
  from {{ ref('fct_orders') }}
{% endset %}

{{ audit_helper.compare_queries(
    a_query=old_fct_orders_query,
    b_query=new_fct_orders_query,
    primary_key="order_id"
) }}

Arguments:

  • summarize (optional): Allows you to switch between a summary or detaied view of the compared data. Accepts true or false vaules. Defaults to true.

compare_column_values (source)

This macro will return a query, that, when executed, compares a column across two queries, and summarizes how many records match perfectly (note: a primary key is required to match values across the two queries).

match_status count percent_of_total
โœ…: perfect match 37,721 79.03
โœ…: both are null 5,789 12.13
๐Ÿคท: missing from b 25 0.05
๐Ÿคท: value is null in a only 59 0.12
๐Ÿคท: value is null in b only 73 0.15
๐Ÿ™…: โ€values do not match 4,064 8.51

This macro is useful when:

  • You've used the compare_queries macro (above) and found that a significant number of your records don't match.
  • So now you want to find which column is causing most of these discrepancies.

Usage:

{# in dbt Develop #}

{% set old_etl_relation_query %}
    select * from public.dim_product
    where is_latest
{% endset %}

{% set new_etl_relation_query %}
    select * from {{ ref('dim_product') }}
{% endset %}

{% set audit_query = audit_helper.compare_column_values(
    a_query=old_etl_relation_query,
    b_query=new_etl_relation_query,
    primary_key="product_id",
    column_to_compare="status"
) %}

{% set audit_results = run_query(audit_query) %}

{% if execute %}
{% do audit_results.print_table() %}
{% endif %}

Usage notes:

  • primary_key must be a unique key in both tables, otherwise the join won't work as expected.

Advanced usage - dbt Cloud:

The .print_table() function is not compatible with dbt Cloud so an adjustment needs to be made in order to print the results. Add the following code to a new macro file. To run the macro, execute dbt run-operation print_audit_output() in the command bar.

{% macro print_audit_output() %}
{%- set columns_to_compare=adapter.get_columns_in_relation(ref('fct_orders'))  -%}

{% set old_etl_relation_query %}
    select * from public.dim_product
{% endset %}

{% set new_etl_relation_query %}
    select * from {{ ref('fct_orders') }}
{% endset %}

{% if execute %}
    {% for column in columns_to_compare %}
        {{ log('Comparing column "' ~ column.name ~'"', info=True) }}
        {% set audit_query = audit_helper.compare_column_values(
                a_query=old_etl_relation_query,
                b_query=new_etl_relation_query,
                primary_key="order_id",
                column_to_compare=column.name
        ) %}

        {% set audit_results = run_query(audit_query) %}

        {% do log(audit_results.column_names, info=True) %}
            {% for row in audit_results.rows %}
                  {% do log(row.values(), info=True) %}
            {% endfor %}
    {% endfor %}
{% endif %}

{% endmacro %}

compare_relation_columns (source)

This macro will return a query, that, when executed, compares the ordinal_position and data_types of columns in two Relations.

column_name a_ordinal_position b_ordinal_position a_data_type b_data_type
order_id 1 1 integer integer
customer_id 2 2 integer integer
order_date 3 3 timestamp date
status 4 5 character varying character varying
amount 5 4 bigint bigint

This is especially useful in two situations:

  1. Comparing a new version of a relation with an old one, to make sure that the structure is the same
  2. Helping figure out why a union of two relations won't work (often because the data types are different)

For example, in the above result set, we can see that status and amount have switched order. Further, order_date is a timestamp in our "a" relation, whereas it is a date in our "b" relation.

Note: For adapters other than BigQuery, Postgres, Redshift, and Snowflake, the ordinal_position is inferred based on the response from dbt Core's adapter.get_columns_in_relation(), as opposed to being loaded from the information schema.

{#- in dbt Develop -#}

{% set old_etl_relation=adapter.get_relation(
      database=target.database,
      schema="old_etl_schema",
      identifier="fct_orders"
) -%}

{% set dbt_relation=ref('fct_orders') %}

{{ audit_helper.compare_relation_columns(
    a_relation=old_etl_relation,
    b_relation=dbt_relation
) }}

compare_all_columns (source)

This macro is designed to be added to a dbt test suite as a custom test. A compare_all_columns test monitors changes data values when code is changed as part of a PR or during development. It sets up a test that will fail if any column values do not match.

Users can configure what exactly constitutes a value match or failure. If there is a test failure, results can be inspected in the warehouse. The primary key and the column name can be included in the test output that gets written to the warehouse. This enables the user to join test results to relevant tables in your dev or prod schema to investigate the error.

Usage:

Note: this test should only be used on (and will only work on) models that have a primary key that is reliably unique and not_null. Generic dbt tests should be used to ensure the model being tested meets the requirements of unique and not_null.

To create a test for the stg_customers model, create a custom test in the tests subdirectory of your dbt project that looks like this:

{{ 
  audit_helper.compare_all_columns(
    a_relation=ref('stg_customers'), -- in a test, this ref will compile as your dev or PR schema.
    b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'), -- you can explicitly write a relation to select your production schema, or any other db/schema/table you'd like to use for comparison testing.
    exclude_columns=['updated_at'], 
    primary_key='id'
  ) 
}}
where not perfect_match

The where not perfect_match statement is an example of a filter you can apply to define what constitutes a test failure. The test will fail if any rows don't meet the requirement of a perfect match. Failures would include:

  • If the primary key exists in both relations, but one model has a null value in a column.
  • If a primary key is missing from one relation.
  • If the primary key exists in both relations, but the value conflicts.

If you'd like the test to only fail when there are conflicting values, you could configure it like this:

{{ 
  audit_helper.compare_all_columns(
    a_relation=ref('stg_customers'), 
    b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'),
    primary_key='id'
  ) 
}}
where conflicting_values

Arguments:

  • a_relation and b_relation: The relations you want to compare. Any two relations that have the same columns can be used. In the example above, two different approaches to writing relations, using ref and using api.Relation.create, are demonstrated. (When writing one-off code, it might make sense to hard-code a relation, like this: analytics_prod.stg_customers. A hard-coded relation is not recommended when building this macro into a CI cycle.)
  • exclude_columns (optional): Any columns you wish to exclude from the validation.
  • primary_key: The primary key of the model (or concatenated sql to create the primary key). Used to sort unmatched results for row-by-row validation.

If you want to create test results that include columns from the model itself for easier inspection, that can be written into the test:

{{ 
  audit_helper.compare_all_columns(
    a_relation=ref('stg_customers'),
    b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'), 
    exclude_columns=['updated_at'], 
    primary_key='id'
  ) 
}}
left join {{ ref('stg_customers') }} using(id)

This structure also allows for the test to group or filter by any attribute in the model or in the macro's output as part of the test, for example:

with base_test_cte as (
  {{ 
    audit_helper.compare_all_columns(
      a_relation=ref('stg_customers'),
      b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'), 
      exclude_columns=['updated_at'], 
      primary_key='id'
    ) 
  }}
  left join {{ ref('stg_customers') }} using(id)
  where conflicting_values
)
select
  status, -- assume there's a "status" column in stg_customers
  count(distinct case when conflicting_values then id end) as conflicting_values
from base_test_cte
group by 1

You can write a compare_all_columns test on individual table; and the test will be run as part of a full test suite run.

dbt test --select stg_customers

If you want to store results in the warehouse for further analysis, add the --store-failures flag.

dbt test --select stg_customers --store-failures

More Repositories

1

dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
Python
7,503
star
2

metricflow

MetricFlow allows you to define, build, and maintain metrics in code.
Python
1,086
star
3

dbt-utils

Utility functions for dbt projects.
Python
929
star
4

corp

Assets related to the operation of Fishtown Analytics.
411
star
5

dbt-project-evaluator

This package contains macros and models to find DAG issues automatically
Shell
385
star
6

dbt-spark

dbt-spark contains all of the code enabling dbt to work with Apache Spark and Databricks
Python
365
star
7

dbt-codegen

Macros that generate dbt code
Makefile
322
star
8

dbt-external-tables

dbt macros to stage external sources
PLpgSQL
285
star
9

jaffle_shop

A self-contained dbt project for testing purposes
264
star
10

dbt_metrics

Macros for calculating metrics
Python
203
star
11

dbt-snowflake

dbt-snowflake contains all of the code enabling dbt to work with Snowflake
Python
199
star
12

dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
Python
163
star
13

dbt-project-maturity

132
star
14

dbt-labs-experimental-features

dbt support for database features which are not yet supported natively in dbt-core
Makefile
131
star
15

snowplow

Data models for snowplow analytics.
LookML
126
star
16

docs.getdbt.com

The code behind docs.getdbt.com
JavaScript
111
star
17

dbt-docs

Auto-generated data documentation site for dbt projects
JavaScript
111
star
18

dbt-jsonschema

Python
104
star
19

jaffle_shop_duckdb

Get started with dbt in less than 1 minute from `git clone` to `dbt docs serve` for free!
Shell
99
star
20

dbt-event-logging

a dbt package to make auditing dbt runs easy.
LookML
91
star
21

dbt-meshify

A dbt-core plugin that automates the management and creation of dbt groups, contracts, access, and versions.
Python
91
star
22

dbt-server

A web API for dbt.
Python
89
star
23

redshift

Redshift package for dbt (getdbt.com)
PLpgSQL
87
star
24

dbt-completion.bash

Adds autocompletion to the dbt CLI
Shell
82
star
25

dbt-redshift

dbt-redshift contains all of the code enabling dbt to work with Amazon Redshift
Python
80
star
26

terraform-provider-dbtcloud

dbt Cloud Terraform Provider
Go
73
star
27

mrr-playbook

Makefile
70
star
28

dbt-learn-group-training

The go to demo for public and private dbt Learn
69
star
29

segment

Data models for Segment built using dbt (getdbt.com).
Makefile
65
star
30

dbt-sql-formatter

makes your sql less bad
Python
57
star
31

dbt-semantic-interfaces

The shared semantic layer definitions that dbt-core and MetricFlow use.
Python
55
star
32

dbt-init

A dbt-init script for consulting projects
Python
51
star
33

jaffle-shop-template

Template for a DuckDB-based, Codespace-oriented sandbox project that is also dbt Cloud compatible, and includes code-first BI tooling via Evidence.
Python
49
star
34

dbt-starter-project

Cloned by the `dbt init` task
44
star
35

jaffle-shop

๐Ÿฅช๐Ÿฆ˜ An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
42
star
36

dbt-learn-jinja

41
star
37

metrics-playbook

This repository contains files for the metrics framework playbook.
36
star
38

facebook-ads

dbt data models for facebook ads
35
star
39

dbt-presto

[ARCHIVED] The Presto adapter plugin for dbt Core
Python
33
star
40

spark-utils

Utility functions for dbt projects running on Spark
Python
30
star
41

stripe

TSQL
29
star
42

attribution-playbook

TSQL
27
star
43

dbt-databricks-demo

Demo project for dbt on Databricks
27
star
44

dbt-technical-blog-writing

Conversation around dbt technical tutorials, blogs, guides, etc
26
star
45

python-snowpark-formula1

Python
25
star
46

dbtdocs-to-lookml

[WIP] A script to add descriptions from dbt schema files to your lookml project
Python
25
star
47

jaffle_shop_metrics

This repository contains examples of how to use dbt's metric functionality on the jaffle shop dataset
24
star
48

tree-sitter-jinja2

C
23
star
49

quickbooks

dbt data models for Quickbooks Online.
TSQL
23
star
50

hub.getdbt.com

Package hub for dbt.
SCSS
23
star
51

dbt-jobs-as-code

Tools to handle dbt Jobs as well-defined YAML files
Python
22
star
52

semantic-layer-llm-benchmarking

Shell
22
star
53

dbt-extractor

Rust
21
star
54

airflow-fivetran-dbt

Example orchestration pipeline for Fivetran + dbt managed by Airflow
Python
20
star
55

rapid-onboarding-exemplar

dbt Project for Rapid Onboarding instructors to use in instruction and learners to reference throughout the course.
Python
20
star
56

jaffle-sl-template

19
star
57

2023-04-18---zero-to-dbt

Welcome! This dbt project is built to be imported to a freshly-initialized dbt project to work through the hands-on zero to dbt lab detailed in this repo's readme.
17
star
58

dbt-starburst-demo

dbt + Trino demo project, using TPC-H sample data
17
star
59

atom-dbt

Atom highlighter for dbt projects
16
star
60

adwords

dbt adwords models
16
star
61

dbot

An LLM-powered chatbot with the added context of the dbt knowledge base.
Python
16
star
62

dbt-adapter-tests

a pytest plugin for dbt adapter test suites
Python
16
star
63

dbt_faker

16
star
64

dbt-rpc

A server that can compile and run queries in the context of a dbt project. Additionally, it provides methods that can be used to list and terminate running processes.
Python
16
star
65

postgres

Postgres utility package for dbt (getdbt.com)
14
star
66

dbt-core-bundles

Generates bundles of verified adapters + core
Python
14
star
67

dbt-database-adapter-scaffold

Python
14
star
68

tap-framework

a framework for rapidly prototyping new singer taps
Python
14
star
69

actions

Common GitHub actions and workflows for maintaining dbt
Python
12
star
70

databricks_dbt_demo_project

dbt Cloud project for Databricks SQL Analytics Demos
Python
12
star
71

salesforce

11
star
72

homebrew-dbt

๐Ÿป Homebrew formulae for installing dbt on macOS
Ruby
11
star
73

stitch-utils

Utility functions for Stitch-loaded data
Makefile
10
star
74

dbt-cloud-snowflake-demo-template

9
star
75

dbt-styleguide

Styles for dbt on the net
HTML
9
star
76

terraform-aws-dbt-cloud-single-tenant

HCL
9
star
77

tap-s3-csv

Singer tap for getting CSV and XLS(X) data out of Amazon S3
Python
9
star
78

analytics-engineering-survey

A survey of pains, gains, and areas of investment for global data teams.
9
star
79

dbt-package-workshop

The companion repo to the 2022 Coalesce New Orleans Workshop - dbt Packages You Didn't Know You Needed
9
star
80

jaffle-shop-generator

Python
8
star
81

hubcap

This app adds modules to the hubsite at hub.getdbt.com
Python
8
star
82

dbt-python-hands-on-lab-snowpark

Python
7
star
83

shopify-data-warehouse

LookML
7
star
84

ecommerce

LookML
7
star
85

python-string-parsing

Demo using dateutil library with pandas dataframes in Python
Python
7
star
86

hologram

A library for automatically generating Draft 7 JSON Schemas from Python dataclasses
Python
7
star
87

shopify

dbt data models for Shopify.
TSQL
7
star
88

dbt-cloud-openapi-spec

Python
6
star
89

new-python-wrench-demo

Demo of Python models to accompany the Snowflake blog post
Jupyter Notebook
6
star
90

dbt-integration-tests

Python
6
star
91

jaffle_shop-dev

The dev version of jaffle shop
6
star
92

snowflake-resource-monitoring

6
star
93

log-tables-example

Example of log table creation using dbt.
Shell
5
star
94

tap-amazon-mws

Singer.io tap for Amazon MWS
Python
5
star
95

tap-ringcentral

Singer.io tap for RingCentral
Python
5
star
96

snowflake_dbt_partner_demo

PLpgSQL
5
star
97

Coalesce2022-Training-Fundamentals

This is the repository for the Coalesce 2022 Fundamentals Live training
5
star
98

fishtown-ui

The design implementation used at Fishtown Analytics.
TypeScript
5
star
99

coalesce-2022-dag-workshop

Get more from your DAG! A Coalesce 2022 workshop that outlines how to use macros and DAG introspection to get more from your dbt project.
5
star
100

tap-amazon-advertising

Singer.io tap for Amazon Advertising
Python
5
star