• Stars
    star
    478
  • Rank 88,515 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created about 4 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Monitor the stability of a Pandas or Spark dataframe ⚙︎

Population Shift Monitoring

Build status Package docs status Latest GitHub release GitHub Release Date PyPi downloads Ruff

POPMON logo

popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms using statistical tests, both over time and with respect to a reference. It works with numerical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations between any two features. popmon can automatically flag and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc, using monitoring business rules.

Traffic Light Overview

Histogram inspector

Announcements

Spark 3.0

With Spark 3.0, based on Scala 2.12, make sure to pick up the correct histogrammar jar files:

spark = SparkSession.builder.config(
    "spark.jars.packages",
    "io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20",
).getOrCreate()

For Spark 2.X compiled against scala 2.11, in the string above simply replace 2.12 with 2.11.

Examples

Documentation

The entire popmon documentation including tutorials can be found at read-the-docs.

Notebooks

Tutorial Colab link
Basic tutorial Open in Colab
Detailed example (featuring configuration, Apache Spark and more) Open in Colab
Incremental datasets (online analysis) Open in Colab
Report interpretation (step-by-step guide) Open in Colab

Check it out

The popmon library requires Python 3.6+ and is pip friendly. To get started, simply do:

$ pip install popmon

or check out the code from our GitHub repository:

$ git clone https://github.com/ing-bank/popmon.git
$ pip install -e popmon

where in this example the code is installed in edit mode (option -e).

You can now use the package in Python with:

import popmon

Congratulations, you are now ready to use the popmon library!

Quick run

As a quick example, you can do:

import pandas as pd
import popmon
from popmon import resources

# open synthetic data
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])
df.head()

# generate stability report using automatic binning of all encountered features
# (importing popmon automatically adds this functionality to a dataframe)
report = df.pm_stability_report(time_axis="date", features=["date:age", "date:gender"])

# to show the output of the report in a Jupyter notebook you can simply run:
report

# or save the report to file
report.to_file("monitoring_report.html")

To specify your own binning specifications and features you want to report on, you do:

# time-axis specifications alone; all other features are auto-binned.
report = df.pm_stability_report(
    time_axis="date", time_width="1w", time_offset="2020-1-6"
)

# histogram selections. Here 'date' is the first axis of each histogram.
features = [
    "date:isActive",
    "date:age",
    "date:eyeColor",
    "date:gender",
    "date:latitude",
    "date:longitude",
    "date:isActive:age",
]

# Specify your own binning specifications for individual features or combinations thereof.
# This bin specification uses open-ended ("sparse") histograms; unspecified features get
# auto-binned. The time-axis binning, when specified here, needs to be in nanoseconds.
bin_specs = {
    "longitude": {"bin_width": 5.0, "bin_offset": 0.0},
    "latitude": {"bin_width": 5.0, "bin_offset": 0.0},
    "age": {"bin_width": 10.0, "bin_offset": 0.0},
    "date": {
        "bin_width": pd.Timedelta("4w").value,
        "bin_offset": pd.Timestamp("2015-1-1").value,
    },
}

# generate stability report
report = df.pm_stability_report(features=features, bin_specs=bin_specs, time_axis=True)

These examples also work with spark dataframes. You can see the output of such example notebook code here. For all available examples, please see the tutorials at read-the-docs.

Pipelines for monitoring dataset shift

Advanced users can leverage popmon's modular data pipeline to customize their workflow. Visualization of the pipeline can be useful when debugging, or for didactic purposes. There is a script included with the package that you can use. The plotting is configurable, and depending on the options you will obtain a result that can be used for understanding the data flow, the high-level components and the (re)use of datasets.

Pipeline Visualization

Example pipeline visualization (click to enlarge)

Reports and integrations

The data shift computations that popmon performs, are by default displayed in a self-contained HTML report. This format is favourable in many real-world environments, where access may be restricted. Moreover, reports can be easily shared with others.

Access to the datastore means that its possible to integrate popmon in almost any workflow. To give an example, one could store the histogram data in a PostgreSQL database and load that from Grafana and benefit from their visualisation and alert handling features (e.g. send an email or slack message upon alert). This may be interesting to teams that are already invested in particular choice of dashboarding tool.

Possible integrations are:

Grafana logo Kibana logo
Grafana Kibana

Resources on how to integrate popmon are available in the examples directory. Contributions of additional or improved integrations are welcome!

Comparison and profile extensions

External libraries or custom functionality can be easily added to Profiles and Comparisons. If you developed an extension that could be generically used, then please consider contributing it to the package.

Popmon currently integrates:

A Python/C++ implementation of Hartigan & Hartigan's dip test for unimodality. The dip test tests for multimodality in a sample by taking the maximum difference, over all sample points, between the empirical distribution function, and the unimodal distribution function that minimizes that maximum difference. Other than unimodality, it makes no further assumptions about the form of the null distribution.

To enable this extension install diptest using pip install diptest or pip install popmon[diptest].

Resources

Presentations

Title Host Date Speaker
popmon: Analysis Package for Dataset Shift Detection SciPy Conference 2022 July 13, 2022 Simon Brugman
Popmon - population monitoring made easy Big Data Technology Warsaw Summit 2021 February 25, 2021 Simon Brugman
Popmon - population monitoring made easy Data Lunch @ Eneco October 29, 2020 Max Baak, Simon Brugman
Popmon - population monitoring made easy Data Science Summit 2020 October 16, 2020 Max Baak
Population Shift Monitoring Made Easy: the popmon package Online Data Science Meetup @ ING WBAA July 8 2020 Tomas Sostak
Popmon: Population Shift Monitoring Made Easy PyData Fest Amsterdam 2020 June 16, 2020 Tomas Sostak
Popmon: Population Shift Monitoring Made Easy Amundsen Community Meetup June 4, 2020 Max Baak

Articles

Title Date Author
POPMON v1.0.0: The Dataset-Shift Pokémon Aug 3, 2022 Pradyot Patil
Monitoring Model Drift with Python April 16, 2022 Jeanine Schoonemann
The Statistics Underlying the Popmon Hood April 15, 2022 Jurriaan Nagelkerke and Jeanine Schoonemann
popmon: code breakfast session November 9, 2022 Simon Brugman
Population Shift Analysis: Monitoring Data Quality with Popmon May 21, 2021 Vito Gentile
Popmon Open Source Package — Population Shift Monitoring Made Easy May 20, 2020 Nicole Mpozika

Software

  • Kedro-popmon is a plugin to integrate popmon reporting with kedro. This plugin allows you to automate the process of popmon feature and output stability monitoring. Package created by Marian Dabrowski and Stephane Collot.

Project contributors

This package was authored by ING Analytics Wholesale Banking (INGA WB). Special thanks to the following people who have contributed to the development of this package: Ahmet Erdem, Fabian Jansen, Nanne Aben, Mathieu Grimal.

Citing popmon

If popmon has been relevant in your work, and you would like to acknowledge the project in your publication, we suggest citing the following paper:

  • Brugman, S., Sostak, T., Patil, P., Baak, M. popmon: Analysis Package for Dataset Shift Detection. Proceedings of the 21st Python in Science Conference. 161-168 (2022). (link)

In BibTeX format:

@InProceedings{ popmon-proc-scipy-2022,
  author    = { {S}imon {B}rugman and {T}omas {S}ostak and {P}radyot {P}atil and {M}ax {B}aak },
  title     = { popmon: {A}nalysis {P}ackage for {D}ataset {S}hift {D}etection },
  booktitle = { {P}roceedings of the 21st {P}ython in {S}cience {C}onference },
  pages     = { 161 - 168 },
  year      = { 2022 },
  editor    = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe },
}

Contact and support

Please note that INGA WB provides support only on a best-effort basis.

License

Copyright INGA WB. popmon is completely free, open-source and licensed under the MIT license.

More Repositories

1

lion

Fundamental white label web component features for your design system.
JavaScript
1,680
star
2

sparse_dot_topn

Python package to accelerate the sparse matrix multiplication and top-n similarity selection
C++
379
star
3

baker

Orchestrate microservice-based process flows
Scala
317
star
4

threshold-signatures

Threshold Signature Scheme for ECDSA
Rust
198
star
5

zkrp

Reusable library for creating and verifying zero-knowledge range proofs and set membership proofs.
Go
170
star
6

flink-deployer

A tool that help automate deployment to an Apache Flink cluster
Go
149
star
7

probatus

Validation (like Recursive Feature Elimination for SHAP) of (multiclass) classifiers & regressors and data used to develop them.
Python
121
star
8

scruid

Scala + Druid: Scruid. A library that allows you to compose queries in Scala, and parse the result back into typesafe classes.
Scala
115
star
9

skorecard

scikit-learn compatible tools for building credit risk acceptance models
Python
78
star
10

rokku

Rokku project. This project acts as a proxy on top of any S3 storage solution providing services like authentication, authorization, short-term tokens, and lineage.
Scala
65
star
11

cassandra-jdbc-wrapper

A JDBC wrapper of Java Driver for Apache Cassandra®, which offers a simple JDBC compliant API to work with CQL3.
Java
57
star
12

ing-open-banking-cli

Shell
42
star
13

bdd-mobile-security-automation-framework

Mobile Security testing Framework
Ruby
40
star
14

ing-open-banking-sdk

Mustache
34
star
15

doing-cli

CLI tool to simplify the development workflow on azure devops
Python
32
star
16

spark-matcher

Record matching and entity resolution at scale in Spark
Python
28
star
17

industry2vec

Jupyter Notebook
27
star
18

gohateoas

Plug-and-play HATEOAS for REST API's written in Go
Go
24
star
19

rokku-dev-apache-atlas

Apache Atlas development image for the Rokku project: https://github.com/ing-bank/rokku
Shell
20
star
20

vscode-psl

For distributing plugins to the community and foster further developments with the community going forward.
TypeScript
19
star
21

apache-ranger-s3-plugin

Apache Ranger Plugin for S3
Java
18
star
22

EntityMatchingModel

Entity Matching Model solves the problem of matching company names between two possibly very large datasets.
Python
16
star
23

skafos

Kubenetes operator framework in Python
Python
13
star
24

zkkrypto

Collection of ZKP-related cryptographic primitives
Kotlin
12
star
25

zkflow

The ZKFlow consensus protocol enables private transactions on Corda for arbitrary smart contracts using Zero Knowledge Proofs
Kotlin
11
star
26

quota-scaler

Kubernetes Autoscaling operator
Go
10
star
27

rokku-sts

STS service for the Rokku project: https://github.com/ing-bank/rokku
Scala
9
star
28

tsforecast

A pipeline to execute time series forecasts and visualize them in a dashboard. The employable forecast models fall into three categories: simple heuristics (mean of last 12 months, last 3 months etc), classical time series econometrics (ARIMA, Holt-Winters, Kalman filters etc.) and machine learning (Neural networks, Facebook’s Prophet etc.)
R
7
star
29

prometheus-scenarios

This repo contains a collection of learning scenarios. Each scenario is meant to teach a topic through explanation and practical exercices.
Go
6
star
30

orchestration-pkg

orchestration-pkg
Go
5
star
31

rokku-dev-apache-ranger

Apache Ranger development image for the Rokku project: https://github.com/ing-bank/rokku
Shell
5
star
32

ing-ideal-connectors-java

Opensource tools and API to connect webshops and merchants to ING using iDeal
Java
4
star
33

ing-ideal-connectors-php

Opensource tools and API to connect webshops and merchants to ING using iDeal
PHP
4
star
34

ing-ideal-connectors-net

Opensource tools and API to connect webshops and merchants to ING using iDeal
C#
4
star
35

mint

An automated exploratory testing tool for Android
Kotlin
4
star
36

gormtestutil

Utilities for writing unit-tests with Gorm
Go
4
star
37

tsclean

Takes a time series, possibly with multiple groupings, and starts up a dashboard to visualize potential anomalies. Anomalies are detected via several algorithms. If anomalies are erroneous, the user can correct them from within the dashboard
R
4
star
38

psl-linter

TypeScript
3
star
39

ginerr

Error registry for Gin, translate rough error messages to user-friendly objects and status codes
Go
3
star
40

psl-parser

TypeScript
2
star
41

tstools

A set of helper functions, geared mostly towards data with a time dimension
R
2
star
42

rokku-dev-keycloak

Keycloak development image for the Rokku project: https://github.com/ing-bank/rokku
Dockerfile
1
star
43

gintestutil

Utilities for writing unit-tests with Gin
Go
1
star
44

rokku-dev-mariadb

MariaDB development image for the Rokku project: https://github.com/ing-bank/rokku
Dockerfile
1
star