• Stars
    star
    162
  • Rank 232,284 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A framework for prototyping and benchmarking imputation methods

HyperImpute - A library for NaNs and nulls.

Test In Colab Tests PR Tests Full Tutorials Documentation Status

arXiv License: MIT Python 3.7+ slack

image

HyperImpute simplifies the selection process of a data imputation algorithm for your ML pipelines. It includes various novel algorithms for missing data and is compatible with sklearn.

HyperImpute features

  • πŸš€ Fast and extensible dataset imputation algorithms, compatible with sklearn.
  • πŸ”‘ New iterative imputation method: HyperImpute.
  • πŸŒ€ Classic methods: MICE, MissForest, GAIN, MIRACLE, MIWAE, Sinkhorn, SoftImpute, etc.
  • πŸ”₯ Pluginable architecture.

πŸš€ Installation

The library can be installed from PyPI using

$ pip install hyperimpute

or from source, using

$ pip install .

πŸ’₯ Sample Usage

List available imputers

from hyperimpute.plugins.imputers import Imputers

imputers = Imputers()

imputers.list()

Impute a dataset using one of the available methods

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

method = "gain"

plugin = Imputers().get(method)
out = plugin.fit_transform(X.copy())

print(method, out)

Specify the baseline models for HyperImpute

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

plugin = Imputers().get(
    "hyperimpute",
    optimizer="hyperband",
    classifier_seed=["logistic_regression"],
    regression_seed=["linear_regression"],
)

out = plugin.fit_transform(X.copy())
print(out)

Use an imputer with a SKLearn pipeline

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
y = pd.Series([1, 2, 1, 2])

imputer = Imputers().get("hyperimpute")

estimator = Pipeline(
    [
        ("imputer", imputer),
        ("forest", RandomForestRegressor(random_state=0, n_estimators=100)),
    ]
)

estimator.fit(X, y)

Write a new imputation plugin

from sklearn.impute import KNNImputer
from hyperimpute.plugins.imputers import Imputers, ImputerPlugin

imputers = Imputers()

knn_imputer = "custom_knn"

class KNN(ImputerPlugin):
    def __init__(self) -> None:
        super().__init__()
        self._model = KNNImputer(n_neighbors=2, weights="uniform")

    @staticmethod
    def name():
        return knn_imputer

    @staticmethod
    def hyperparameter_space():
        return []

    def _fit(self, *args, **kwargs):
        self._model.fit(*args, **kwargs)
        return self

    def _transform(self, *args, **kwargs):
        return self._model.transform(*args, **kwargs)

imputers.add(knn_imputer, KNN)

assert imputers.get(knn_imputer) is not None

Benchmark imputation models on a dataset

from sklearn.datasets import load_iris
from hyperimpute.plugins.imputers import Imputers
from hyperimpute.utils.benchmarks import compare_models

X, y = load_iris(as_frame=True, return_X_y=True)

imputer = Imputers().get("hyperimpute")

compare_models(
    name="example",
    evaluated_model=imputer,
    X_raw=X,
    ref_methods=["ice", "missforest"],
    scenarios=["MAR"],
    miss_pct=[0.1, 0.3],
    n_iter=2,
)

πŸ““ Tutorials

⚑ Imputation methods

The following table contains the default imputation plugins:

Strategy Description Code
HyperImpute Iterative imputer using both regression and classification methods based on linear models, trees, XGBoost, CatBoost and neural nets plugin_hyperimpute.py
Mean Replace the missing values using the mean along each column with SimpleImputer plugin_mean.py
Median Replace the missing values using the median along each column with SimpleImputer plugin_median.py
Most-frequent Replace the missing values using the most frequent value along each column with SimpleImputer plugin_most_freq.py
MissForest Iterative imputation method based on Random Forests using IterativeImputer and ExtraTreesRegressor plugin_missforest.py
ICE Iterative imputation method based on regularized linear regression using IterativeImputer and BayesianRidge plugin_ice.py
MICE Multiple imputations based on ICE using IterativeImputer and BayesianRidge plugin_mice.py
SoftImpute Low-rank matrix approximation via nuclear-norm regularization plugin_softimpute.py
EM Iterative procedure which uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization) - EM imputation algorithm plugin_em.py
Sinkhorn Missing Data Imputation using Optimal Transport plugin_sinkhorn.py
GAIN GAIN: Missing Data Imputation using Generative Adversarial Nets plugin_gain.py
MIRACLE MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms plugin_miracle.py
MIWAE MIWAE: Deep Generative Modelling and Imputation of Incomplete Data plugin_miwae.py

πŸ”¨ Tests

Install the testing dependencies using

pip install .[testing]

The tests can be executed using

pytest -vsx

Citing

If you use this code, please cite the associated paper:

@article{Jarrett2022HyperImpute,
  doi = {10.48550/ARXIV.2206.07769},
  url = {https://arxiv.org/abs/2206.07769},
  author = {Jarrett, Daniel and Cebere, Bogdan and Liu, Tennison and Curth, Alicia and van der Schaar, Mihaela},
  keywords = {Machine Learning (stat.ML), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {HyperImpute: Generalized Iterative Imputation with Automatic Model Selection},
  year = {2022},
  booktitle={39th International Conference on Machine Learning},
}

More Repositories

1

mlforhealthlabpub

Machine Learning and Artificial Intelligence for Medicine.
Python
426
star
2

synthcity

A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.
Python
406
star
3

clairvoyance

Clairvoyance: a Unified, End-to-End AutoML Pipeline for Medical Time Series
Jupyter Notebook
121
star
4

autoprognosis

A system for automating the design of predictive modeling pipelines tailored for clinical prognosis.
Python
101
star
5

temporai

TemporAI: ML-centric Toolkit for Medical Time Series
Python
94
star
6

Interpretability

Resources for Machine Learning Explainability
Python
60
star
7

survivalgan

SurvivalGAN: Generating Time-to-Event Data for Survival Analysis
Jupyter Notebook
18
star
8

Datagnosis

A Data-Centric library providing a unified interface for state-of-the-art methods for hardness characterisation of data points.
Jupyter Notebook
17
star
9

synthetic-data-lab

A repository containing the materials required to complete the "AAAI Lab for Innovative Uses of Synthetic Data". This includes tutorials on how to use the library "Synthcity" for improving the fairness and privacy of a dataset as well as for augmenting a small dataset using some other similar datasets.
Jupyter Notebook
12
star
10

ml-as-prostate-cancer

Code repository for paper "Development and clinical utility of machine learning algorithms for dynamic longitudinal real-time estimation of progression risks in active surveillance of early prostate cancer"
Jupyter Notebook
9
star
11

NLDL-Synthetic-data-tutorial

The GitHub Repo for the hands-on session at NLDL entitled "Tutorial: Innovative Uses of Synthetic Data Tutorial".
6
star
12

temporai-clinic

Interactive web app for prototyping ML for medicine and healthcare in the real world
Python
4
star
13

INVASE

INVASE: Instance-wise Variable Selection . For more details, read the paper "INVASE: Instance-wise Variable Selection using Neural Networks," International Conference on Learning Representations (ICLR), 2019.
Python
3
star
14

synthcity-benchmarking

Jupyter Notebook
3
star
15

clairvoyance2

clairvoyance2: a Unified Toolkit for Medical Time Series
Python
2
star
16

D-CODE-ICLR-2022

D-CODE: Discovering Closed-form ODEs from Observed Trajectories
Python
1
star
17

AutoPrognosis-Multimodal

AutoPrognosis with imaging data
Python
1
star
18

hide-and-seek-submissions

Repository of NeurIPS 2020 "Hide-and-Seek" competition submissions.
Python
1
star
19

temporai-mivdp

TemporAI-MIVDP: Adaptation of MIMIC-IV-Data-Pipeline for TemporAI
Python
1
star