• Stars
    star
    172
  • Rank 214,725 (Top 5 %)
  • Language
    Python
  • License
    Other
  • Created about 2 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A power-full Shapley feature selection method.

PowerShap logo

PyPI Latest Release support-version codecov Downloads PRs Welcome Testing DOI

powershap is a feature selection method that uses statistical hypothesis testing and power calculations on Shapley values, enabling fast and intuitive wrapper-based feature selection.

Installation βš™οΈ

pip pip install powershap

Usage πŸ› 

powershap is built to be intuitive, it supports various models including linear, tree-based, and even deep learning models for classification and regression tasks.

from powershap import PowerShap
from catboost import CatBoostClassifier

X, y = ...  # your classification dataset

selector = PowerShap(
    model=CatBoostClassifier(n_estimators=250, verbose=0, use_best_model=True)
)

selector.fit(X, y)  # Fit the PowerShap feature selector
selector.transform(X)  # Reduce the dataset to the selected features

Features ✨

  • default automatic mode
  • scikit-learn compatible
  • supports various models
  • insights into the feature selection method: call the ._processed_shaps_df on a fitted PowerSHAP feature selector.
  • tested code!

Benchmarks ⏱

Check out our benchmark results here.

How does it work ⁉️

Powershap is built on the core assumption that an informative feature will have a larger impact on the prediction compared to a known random feature.

  • Powershap trains multiple models with different random seeds on different subsets of the data. Each iteration it adds a random uniform feature to the dataset for training.
  • In a single iteration after training a model, powershap calculates the absolute Shapley values of all features, including the random feature. If there are multiple outputs or multiple classes, powershap uses the maximum across these multiple outputs. These values are then averaged for each feature, symbolising the impact of the feature in this iteration.
  • After performing all iterations, each feature then has an array of impacts. The impact array of each feature is then compared to the average of the random feature impact array using the percentile formula to provide a p-value. This tests whether the feature has a larger impact than the random feature and outputs a low p-value if true.
  • Powershap then outputs all features with a p-value below the provided threshold. The threshold is by default 0.01.

Automatic mode πŸ€–

The required number of iterations and the threshold values are hyperparameters of powershap. However, to avoid manually optimizing the hyperparameters powershap by default uses an automatic mode that automatically determines these hyperparameters.

  • The automatic mode first starts with executing powershap using ten iterations.
  • Then, for each feature powershap calculates the effect size and the statistical power of the test using a student-t power test.
  • Using the calculated effect size, powershap then calculates the required iterations to achieve a predefined power requirement. By default this is 0.99, which represents a false positive probability of 0.01.
  • If the required iterations are larger than the already performed iterations, powershap then further executes for the extra required iterations.
  • Afterward, powershap re-calculates the required iterations and it keeps re-executing until the required iterations are met.

Referencing our package πŸ“

If you use powershap in a scientific publication, we would highly appreciate citing us as:

@InProceedings{10.1007/978-3-031-26387-3_5,
author="Verhaeghe, Jarne
and Van Der Donckt, Jeroen
and Ongenae, Femke
and Van Hoecke, Sofie",
title="Powershap: A Power-Full Shapley Feature Selection Method",
booktitle="Machine Learning and Knowledge Discovery in Databases",
year="2023",
publisher="Springer International Publishing",
address="Cham",
pages="71--87",
isbn="978-3-031-26387-3"
}

Paper was presented at ECML PKDD 2022. The manuscript can be found here and on the github.


πŸ‘€ Jarne Verhaeghe, Jeroen Van Der Donckt

License

This package is available under the MIT license. More information can be found here: https://github.com/predict-idlab/powershap/blob/main/LICENSE

More Repositories

1

plotly-resampler

Visualize large time series data with plotly.py
Python
945
star
2

tsflex

Flexible time series feature extraction & processing
Python
351
star
3

tsdownsample

High-performance time series downsampling algorithms for visualization
Jupyter Notebook
121
star
4

RR-GCN

Code for "R-GCN: The R Could Stand for Random"
Python
36
star
5

sleep-linear

Do not sleep on traditional machine learning for sleep stage scoring
Jupyter Notebook
35
star
6

seriesdistancematrix

Python
27
star
7

ts-datapoint-selection-vis

Data Point Selection for Line Chart Visualization: analysis notebooks and implementation details
HTML
13
star
8

trace-updater

Dash component to update a dcc.Graph its traces via callbacks
JavaScript
11
star
9

MinMaxLTTB

MinMax-preselection for Efficient Time Series Line Chart Visualization (using LTTB)
HTML
8
star
10

causalteshap

Jupyter Notebook
7
star
11

tsflex-benchmarking

HTML
6
star
12

VisCARS

VisCARS: Graph-Based Context-Aware Visualization Recommendation System
Jupyter Notebook
5
star
13

cmc-learner

Implementation of Conformal Monte Carlo (CMC) learner
Jupyter Notebook
5
star
14

class-balancing-paper

Jupyter Notebook
4
star
15

The-Distribution-Coverage-Loss

Jupyter Notebook
3
star
16

ddashboard-ontology

Ontology for the Dynamic Dashboard
3
star
17

REACT

Jupyter Notebook
3
star
18

landmarker

PyTorch-based toolkit for landmark detection
Python
3
star
19

svd-kernels

Repository for code regarding the paper "Parameter-efficient neural networks with singular value decomposed kernels"
Jupyter Notebook
3
star
20

gssp_analysis

Analysis notebooks and scripts of the gssp web app data collection
Jupyter Notebook
2
star
21

DAHCC-Sources

Resource files for all ontologies described at https://dahcc.idlab.ugent.be
Python
2
star
22

gssp_web_app

Web application to acquire picture description speech data according to the GSSP
HTML
2
star
23

plotly-resampler-benchmarks

Jupyter Notebook
2
star
24

obelisk-python

Python client for the Obelisk API
Python
2
star
25

data-quality-challenges-wearables

Addressing Data Quality Challenges in Observational Ambulatory Studies: Analysis, methodologies and practical solutions for wrist-worn wearable monitoring
Jupyter Notebook
2
star
26

webthing-client-python

Python client for Dynamic Dashboard Webthings
Python
1
star
27

atrial_fibrillation_prediction

Jupyter Notebook
1
star
28

phys-ml-leak-localization

Jupyter Notebook
1
star