• Stars
    star
    187
  • Rank 206,464 (Top 5 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 2 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Frouros: an open-source Python library for drift detection in machine learning systems.

logo


ci coverage documentation downloads pypi python bsd_3_license arxiv

Frouros is a Python library for drift detection in machine learning systems that provides a combination of classical and more recent algorithms for both concept and data drift detection.

"Everything changes and nothing stands still"

"You could not step twice into the same river"

Heraclitus of Ephesus (535-475 BCE.)


⚡️ Quickstart

Concept drift

As a quick example, we can use the breast cancer dataset to which concept drift it is induced and show the use of a concept drift detector like DDM (Drift Detection Method). We can see how concept drift affects the performance in terms of accuracy.

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from frouros.detectors.concept_drift import DDM, DDMConfig
from frouros.metrics import PrequentialError

np.random.seed(seed=31)

# Load breast cancer dataset
X, y = load_breast_cancer(return_X_y=True)

# Split train (70%) and test (30%)
(
    X_train,
    X_test,
    y_train,
    y_test,
) = train_test_split(X, y, train_size=0.7, random_state=31)

# Define and fit model
pipeline = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("model", LogisticRegression()),
    ]
)
pipeline.fit(X=X_train, y=y_train)

# Detector configuration and instantiation
config = DDMConfig(
    warning_level=2.0,
    drift_level=3.0,
    min_num_instances=25,  # minimum number of instances before checking for concept drift
)
detector = DDM(config=config)

# Metric to compute accuracy
metric = PrequentialError(alpha=1.0)  # alpha=1.0 is equivalent to normal accuracy

def stream_test(X_test, y_test, y, metric, detector):
    """Simulate data stream over X_test and y_test. y is the true label."""
    drift_flag = False
    for i, (X, y) in enumerate(zip(X_test, y_test)):
        y_pred = pipeline.predict(X.reshape(1, -1))
        error = 1 - (y_pred.item() == y.item())
        metric_error = metric(error_value=error)
        _ = detector.update(value=error)
        status = detector.status
        if status["drift"] and not drift_flag:
            drift_flag = True
            print(f"Concept drift detected at step {i}. Accuracy: {1 - metric_error:.4f}")
    if not drift_flag:
        print("No concept drift detected")
    print(f"Final accuracy: {1 - metric_error:.4f}\n")

# Simulate data stream (assuming test label available after each prediction)
# No concept drift is expected to occur
stream_test(
    X_test=X_test,
    y_test=y_test,
    y=y,
    metric=metric,
    detector=detector,
)
# >> No concept drift detected
# >> Final accuracy: 0.9766

# IMPORTANT: Induce/simulate concept drift in the last part (20%)
# of y_test by modifying some labels (50% approx). Therefore, changing P(y|X))
drift_size = int(y_test.shape[0] * 0.2)
y_test_drift = y_test[-drift_size:]
modify_idx = np.random.rand(*y_test_drift.shape) <= 0.5
y_test_drift[modify_idx] = (y_test_drift[modify_idx] + 1) % len(np.unique(y_test))
y_test[-drift_size:] = y_test_drift

# Reset detector and metric
detector.reset()
metric.reset()

# Simulate data stream (assuming test label available after each prediction)
# Concept drift is expected to occur because of the label modification
stream_test(
    X_test=X_test,
    y_test=y_test,
    y=y,
    metric=metric,
    detector=detector,
)
# >> Concept drift detected at step 142. Accuracy: 0.9510
# >> Final accuracy: 0.8480

More concept drift examples can be found here.

Data drift

As a quick example, we can use the iris dataset to which data drift is induced and show the use of a data drift detector like Kolmogorov-Smirnov test.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

from frouros.detectors.data_drift import KSTest

np.random.seed(seed=31)

# Load iris dataset
X, y = load_iris(return_X_y=True)

# Split train (70%) and test (30%)
(
    X_train,
    X_test,
    y_train,
    y_test,
) = train_test_split(X, y, train_size=0.7, random_state=31)

# Set the feature index to which detector is applied
feature_idx = 0

# IMPORTANT: Induce/simulate data drift in the selected feature of y_test by
# applying some gaussian noise. Therefore, changing P(X))
X_test[:, feature_idx] += np.random.normal(
    loc=0.0,
    scale=3.0,
    size=X_test.shape[0],
)

# Define and fit model
model = DecisionTreeClassifier(random_state=31)
model.fit(X=X_train, y=y_train)

# Set significance level for hypothesis testing
alpha = 0.001
# Define and fit detector
detector = KSTest()
_ = detector.fit(X=X_train[:, feature_idx])

# Apply detector to the selected feature of X_test
result, _ = detector.compare(X=X_test[:, feature_idx])

# Check if drift is taking place
if result.p_value <= alpha:
    print(f"Data drift detected at feature {feature_idx}")
else:
    print(f"No data drift detected at feature {feature_idx}")
# >> Data drift detected at feature 0
# Therefore, we can reject H0 (both samples come from the same distribution).

More data drift examples can be found here.

🛠 Installation

Frouros can be installed via pip:

pip install frouros

🕵🏻‍♂️️ Drift detection methods

The currently implemented detectors are listed in the following table.

Drift detector Type Family Univariate (U) / Multivariate (M) Numerical (N) / Categorical (C) Method Reference
Concept drift Streaming Change detection U N BOCD Adams and MacKay (2007)
U N CUSUM Page (1954)
U N Geometric moving average Roberts (1959)
U N Page Hinkley Page (1954)
Statistical process control U N DDM Gama et al. (2004)
U N ECDD-WT Ross et al. (2012)
U N EDDM Baena-Garcıa et al. (2006)
U N HDDM-A Frias-Blanco et al. (2014)
U N HDDM-W Frias-Blanco et al. (2014)
U N RDDM Barros et al. (2017)
Window based U N ADWIN Bifet and Gavalda (2007)
U N KSWIN Raab et al. (2020)
U N STEPD Nishida and Yamauchi (2007)
Data drift Batch Distance based U N Anderson-Darling test Scholz and Stephens (1987)
U N Bhattacharyya distance Bhattacharyya (1946)
U N Earth Mover's distance Rubner et al. (2000)
U N Hellinger distance Hellinger (1909)
U N Histogram intersection normalized complement Swain and Ballard (1991)
U N Jensen-Shannon distance Lin (1991)
U N Kullback-Leibler divergence Kullback and Leibler (1951)
M N MMD Gretton et al. (2012)
U N PSI Wu and Olson (2010)
Statistical test U C Chi-square test Pearson (1900)
U N Cramér-von Mises test Cramér (1902)
U N Kolmogorov-Smirnov test Massey Jr (1951)
U N Mann-Whitney U test Mann and Whitney (1947)
U N Welch's t-test Welch (1947)
Streaming Distance based M N MMD Gretton et al. (2012)
Statistical test U N Incremental Kolmogorov-Smirnov test dos Reis et al. (2016)

❗ What is and what is not Frouros?

Unlike other libraries that in addition to provide drift detection algorithms, include other functionalities such as anomaly/outlier detection, adversarial detection, imbalance learning, among others, Frouros has and will ONLY have one purpose: drift detection.

We firmly believe that machine learning related libraries or frameworks should not follow Jack of all trades, master of none principle. Instead, they should be focused on a single task and do it well.

✅ Who is using Frouros?

Frouros is actively being used by the following projects to implement drift detection in machine learning pipelines:

If you want your project listed here, do not hesitate to send us a pull request.

👍 Contributing

Check out the contribution section.

💬 Citation

Although Frouros paper is still in preprint, if you want to cite it you can use the preprint version (to be replaced by the paper once is published).

@article{cespedes2022frouros,
  title={Frouros: A Python library for drift detection in machine learning systems},
  author={C{\'e}spedes-Sisniega, Jaime and L{\'o}pez-Garc{\'\i}a, {\'A}lvaro },
  journal={arXiv preprint arXiv:2208.06868},
  year={2022}
}

📝 License

Frouros is an open-source software licensed under the BSD-3-Clause license.

🙏 Acknowledgements

Frouros has received funding from the Agencia Estatal de Investigación, Unidad de Excelencia María de Maeztu, ref. MDM-2017-0765.

More Repositories

1

opencl

Haskell high-level wrapper for OpenCL
Haskell
69
star
2

pycanon

pyCANON is a Python library and CLI to assess the values of the parameters associated with the most common privacy-preserving techniques.
Python
28
star
3

anjana

ANJANA is a Python library for anonymizing sensitive data
Python
16
star
4

keystone-voms

Keystone VOMS authentication module
Python
6
star
5

keystoneauth-oidc

OpenID Connect support for OpenStack keystoneauth library
Python
4
star
6

keystone-oidc-auth-plugin

OpenID Connect Implementeation for OpenStack Keystone
Python
3
star
7

machanguitos

The Easiest Multi-Agent System
C++
3
star
8

imgsync

Sync distribution public images into your OpenStack Glance installation
Python
2
star
9

FAIR_eva

Python
2
star
10

jobarama

Show case: Web Server for job Workflows
JavaScript
2
star
11

anonymity-ml

Comparison of the performance of machine learning models applied on anonymized data with different techniques
Jupyter Notebook
2
star
12

workflow-DL-HPC

A Container Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters
Python
2
star
13

xdc_lfw_sat

python api to download satellite data from Sentinel-2 and Landsat 8
Python
1
star
14

feynapps_panel

OpenStack Horizon Panel for image contextualization
Python
1
star
15

voms-auth-system-openstack

VOMS authentication plugin for OpenStack APIs
Python
1
star
16

k8s-cluster-deploy

1
star
17

xdc_lfw_data

Python
1
star
18

serverless-covariate-drift-detection

Python
1
star
19

feynapps

Shell
1
star
20

covid-19-spain

Province-based COVID-19 dataset for Spain
1
star
21

mitma-covid

Data processing tool for MITMA covid data
Python
1
star
22

python-anonymity

Anonymization library. Under development
Python
1
star