• Stars
    star
    134
  • Rank 270,967 (Top 6 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created almost 2 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A scikit-learn-compatible module for comparing imputation methods.

GitHubActions ReadTheDocs License PythonVersion PyPi Release Commits Codecov

https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/logo.png

Qolmat - The Tool for Data Imputation

Qolmat provides a convenient way to estimate optimal data imputation techniques by leveraging scikit-learn-compatible algorithms. Users can compare various methods based on different evaluation metrics.

πŸ”— Requirements

Python 3.8+

πŸ›  Installation

Qolmat can be installed in different ways:

$ pip install qolmat  # installation via `pip`
$ pip install qolmat[pytorch] # if you need ImputerDiffusion relying on pytorch
$ pip install git+https://github.com/Quantmetry/qolmat  # or directly from the github repository

⚑️ Quickstart

Let us start with a basic imputation problem. We generate one-dimensional noisy time series with missing values. With just these few lines of code, you can see how easy it is to

  • impute missing values with one particular imputer;
  • benchmark multiple imputation methods with different metrics.
import numpy as np
import pandas as pd

from qolmat.benchmark import comparator, missing_patterns
from qolmat.imputations import imputers
from qolmat.utils import data

# load and prepare csv data

df_data = data.get_data("Beijing")
columns = ["TEMP", "PRES", "WSPM"]
df_data = df_data[columns]
df_with_nan = data.add_holes(df_data, ratio_masked=0.2, mean_size=120)

# impute and compare
imputer_mean = imputers.ImputerMean(groups=("station",))
imputer_interpol = imputers.ImputerInterpolation(method="linear", groups=("station",))
imputer_var1 = imputers.ImputerEM(model="VAR", groups=("station",), method="mle", max_iter_em=50, n_iter_ou=15, dt=1e-3, p=1)
dict_imputers = {
      "mean": imputer_mean,
      "interpolation": imputer_interpol,
      "VAR(1) process": imputer_var1
  }
generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=4, ratio_masked=0.1)
comparison = comparator.Comparator(
      dict_imputers,
      columns,
      generator_holes = generator_holes,
      metrics = ["mae", "wmape", "KL_columnwise", "ks_test", "energy"],
  )
results = comparison.compare(df_with_nan)
results.style.highlight_min(color="lightsteelblue", axis=1)

https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme_tabular_comparison.png

πŸ“˜ Documentation

The full documentation can be found on this link.

How does Qolmat work ?

Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:

  1. For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified hole generator.
  2. For each fold and each compared imputation method, Qolmat fills both the missing and the masked values, then computes each of the default or user specified performance metrics.
  3. For each compared imputer, Qolmat pools the computed metrics from the K folds into a single value.

This is very similar in spirit to the cross_val_score function for scikit-learn.

https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/schema_qolmat.png

Imputation methods

The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic). For further details regarding the distinction between single and multiple imputation, you can refer to the Imputation article on Wikipedia.

Method Description Tabular or Time series Single or Multiple
mean Imputes the missing values using the mean along each column tabular single
median Imputes the missing values using the median along each column tabular single
LOCF Imputes missing entries by carrying the last observation forward for each columns time series single
shuffle Imputes missing entries with the random value of each column tabular multiple
interpolation Imputes missing using some interpolation strategies supported by pd.Series.interpolate time series single
impute on residuals The series are de-seasonalised, residuals are imputed via linear interpolation, then residuals are re-seasonalised time series single
MICE Multiple Imputation by Chained Equation tabular both
RPCA Robust Principal Component Analysis both single
SoftImpute Iterative method for matrix completion that uses nuclear-norm regularization tabular single
KNN K-nearest kneighbors tabular single
EM sampler Imputes missing values via EM algorithm both both
MLP Imputer based Multi-Layers Perceptron Model both both
Autoencoder Imputer based Autoencoder Model with Variationel method both both
TabDDPM Imputer based on Denoising Diffusion Probabilistic Models both both

πŸ“ Contributing

You are welcome to propose and contribute new ideas. We encourage you to open an issue so that we can align on the work to be done. It is generally a good idea to have a quick discussion before opening a pull request that is potentially out-of-scope. For more information on the contribution process, please go here.

🀝 Affiliation

Qolmat has been developed by Quantmetry.

Quantmetry

πŸ” References

[1] CandΓ¨s, Emmanuel J., et al. β€œRobust principal component analysis?.” Journal of the ACM (JACM) 58.3 (2011): 1-37, (pdf)

[2] Wang, Xuehui, et al. β€œAn improved robust principal component analysis model for anomalies detection of subway passenger flow.” Journal of advanced transportation 2018 (2018). (pdf)

[3] Chen, Yuxin, et al. β€œBridging convex and nonconvex optimization in robust PCA: Noise, outliers, and missing data.” Annals of statistics, 49(5), 2948 (2021), (pdf)

[4] Shahid, Nauman, et al. β€œFast robust PCA on graphs.” IEEE Journal of Selected Topics in Signal Processing 10.4 (2016): 740-756. (pdf)

[5] Jiashi Feng, et al. β€œOnline robust pca via stochastic optimization.β€œ Advances in neural information processing systems, 26, 2013. (pdf)

[6] GarcΓ­a, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015. (pdf)

πŸ“ License

Qolmat is free and open-source software licensed under the BSD 3-Clause license.

More Repositories

1

imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
Python
6,549
star
2

sklearn-pandas

Pandas integration with sklearn
Python
2,803
star
3

hdbscan

A high performance implementation of HDBSCAN clustering.
Jupyter Notebook
2,795
star
4

category_encoders

A library of sklearn compatible categorical variable encoders
Python
2,405
star
5

lightning

Large-scale linear classification, regression and ranking in Python
Python
1,716
star
6

boruta_py

Python implementations of the Boruta all-relevant feature selection method.
Python
1,474
star
7

metric-learn

Metric learning algorithms in Python
Python
1,346
star
8

MAPIE

A scikit-learn-compatible module to estimate prediction intervals and control risks based on conformal predictions.
Jupyter Notebook
1,285
star
9

skope-rules

machine learning with logical rules in Python
Jupyter Notebook
541
star
10

DESlib

A Python library for dynamic classifier and ensemble selection
Python
479
star
11

py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
Python
444
star
12

scikit-learn-contrib

scikit-learn compatible projects
400
star
13

project-template

A template for scikit-learn extensions
Python
316
star
14

forest-confidence-interval

Confidence intervals for scikit-learn forest algorithms
HTML
282
star
15

polylearn

A library for factorization machines and polynomial networks for classification and regression in Python.
Python
245
star
16

stability-selection

scikit-learn compatible implementation of stability selection.
Python
195
star
17

skglm

Fast and modular sklearn replacement for generalized linear models
Python
157
star
18

scikit-learn-extra

scikit-learn contrib estimators
Python
155
star
19

hiclass

A python library for hierarchical classification compatible with scikit-learn
Python
113
star
20

scikit-dimension

A Python package for intrinsic dimension estimation
Python
78
star
21

scikit-matter

A collection of scikit-learn compatible utilities that implement methods born out of the materials science and chemistry communities
Python
76
star
22

skdag

A more flexible alternative to scikit-learn Pipelines
Python
29
star
23

denmune-clustering-algorithm

DenMune a clustering algorithm that can find clusters of arbitrary size, shapes and densities in two-dimensions. Higher dimensions are first reduced to 2-D using the t-sne. The algorithm relies on a single parameter K (the number of nearest neighbors). The results show the superiority of DenMune. Enjoy the simplicty but the power of DenMune.
Jupyter Notebook
29
star
24

mimic

mimic calibration
Python
21
star
25

sklearn-ann

Integration with (approximate) nearest neighbors libraries for scikit-learn + clustering based on with kNN-graphs.
Python
14
star
26

scikit-learn-contrib.github.io

Project webpage
HTML
4
star