• Stars
    star
    2,405
  • Rank 19,133 (Top 0.4 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created almost 9 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods

Downloads Downloads Test Suite and Linting DOI

A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques.

Important Links

Documentation: http://contrib.scikit-learn.org/category_encoders/

Encoding Methods

Unsupervised:

  • Backward Difference Contrast [2][3]
  • BaseN [6]
  • Binary [5]
  • Gray [14]
  • Count [10]
  • Hashing [1]
  • Helmert Contrast [2][3]
  • Ordinal [2][3]
  • One-Hot [2][3]
  • Rank Hot [15]
  • Polynomial Contrast [2][3]
  • Sum Contrast [2][3]

Supervised:

  • CatBoost [11]
  • Generalized Linear Mixed Model [12]
  • James-Stein Estimator [9]
  • LeaveOneOut [4]
  • M-estimator [7]
  • Target Encoding [7]
  • Weight of Evidence [8]
  • Quantile Encoder [13]
  • Summary Encoder [13]

Installation

The package requires: numpy, statsmodels, and scipy.

To install the package, execute:

$ python setup.py install

or

pip install category_encoders

or

conda install -c conda-forge category_encoders

To install the development version, you may use:

pip install --upgrade git+https://github.com/scikit-learn-contrib/category_encoders

Usage

All of the encoders are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. Supported input formats include numpy arrays and pandas dataframes. If the cols parameter isn't passed, all columns with object or pandas categorical data type will be encoded. Please see the docs for transformer-specific configuration options.

Examples

There are two types of encoders: unsupervised and supervised. An unsupervised example:

from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston

# prepare some data
bunch = load_boston()
y = bunch.target
X = pd.DataFrame(bunch.data, columns=bunch.feature_names)

# use binary encoding to encode two categorical features
enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X)

# transform the dataset
numeric_dataset = enc.transform(X)

And a supervised example:

from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston

# prepare some data
bunch = load_boston()
y_train = bunch.target[0:250]
y_test = bunch.target[250:506]
X_train = pd.DataFrame(bunch.data[0:250], columns=bunch.feature_names)
X_test = pd.DataFrame(bunch.data[250:506], columns=bunch.feature_names)

# use target encoding to encode two categorical features
enc = TargetEncoder(cols=['CHAS', 'RAD'])

# transform the datasets
training_numeric_dataset = enc.fit_transform(X_train, y_train)
testing_numeric_dataset = enc.transform(X_test)

For the transformation of the training data with the supervised methods, you should use fit_transform() method instead of fit().transform(), because these two methods do not have to generate the same result. The difference can be observed with LeaveOneOut encoder, which performs a nested cross-validation for the training data in fit_transform() method (to decrease over-fitting of the downstream model) but uses all the training data for scoring with transform() method (to get as accurate estimates as possible).

Furthermore, you may benefit from following wrappers:

  • PolynomialWrapper, which extends supervised encoders to support polynomial targets
  • NestedCVWrapper, which helps to prevent overfitting

Additional examples and benchmarks can be found in the examples directory.

Contributing

Category encoders is under active development, if you'd like to be involved, we'd love to have you. Check out the CONTRIBUTING.md file or open an issue on the github project to get started.

References

  1. Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.
  2. Contrast Coding Systems for categorical variables. UCLA: Statistical Consulting Group. From https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.
  3. Gregory Carey (2003). Coding Categorical Variables. From http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
  4. Owen Zhang - Leave One Out Encoding. From https://datascience.stackexchange.com/questions/10839/what-is-difference-between-one-hot-encoding-and-leave-one-out-encoding
  5. Beyond One-Hot: an exploration of categorical variables. From http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
  6. BaseN Encoding and Grid Search in categorical variables. From http://www.willmcginnis.com/2016/12/18/basen-encoding-grid-search-category_encoders/
  7. Daniele Miccii-Barreca (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1. From http://dx.doi.org/10.1145/507533.507538
  8. Weight of Evidence (WOE) and Information Value Explained. From https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
  9. Empirical Bayes for multiple sample sizes. From http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/
  10. Simple Count or Frequency Encoding. From https://www.datacamp.com/community/tutorials/encoding-methodologies
  11. Transforming categorical features to numerical features. From https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/
  12. Andrew Gelman and Jennifer Hill (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. From https://faculty.psau.edu.sa/filedownload/doc-12-pdf-a1997d0d31f84d13c1cdc44ac39a8f2c-original.pdf
  13. Carlos Mougan, David Masip, Jordi Nin and Oriol Pujol (2021). Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems. Modeling Decisions for Artificial Intelligence, 2021. Springer International Publishing https://link.springer.com/chapter/10.1007%2F978-3-030-85529-1_14
  14. Gray Encoding. From https://en.wikipedia.org/wiki/Gray_code
  15. Jacob Buckman, Aurko Roy, Colin Raffel, Ian Goodfellow: Thermometer Encoding: One Hot Way To Resist Adversarial Examples. From https://openreview.net/forum?id=S18Su--CW
  16. Fairness implications of encoding protected categorical attributes. Carlos Mougan, Jose Alvarez, Salvatore Ruggieri, and Steffen Staab. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, https://arxiv.org/abs/2201.11358

More Repositories

1

imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
Python
6,549
star
2

sklearn-pandas

Pandas integration with sklearn
Python
2,803
star
3

hdbscan

A high performance implementation of HDBSCAN clustering.
Jupyter Notebook
2,795
star
4

lightning

Large-scale linear classification, regression and ranking in Python
Python
1,716
star
5

boruta_py

Python implementations of the Boruta all-relevant feature selection method.
Python
1,474
star
6

metric-learn

Metric learning algorithms in Python
Python
1,346
star
7

MAPIE

A scikit-learn-compatible module to estimate prediction intervals and control risks based on conformal predictions.
Jupyter Notebook
1,285
star
8

skope-rules

machine learning with logical rules in Python
Jupyter Notebook
541
star
9

DESlib

A Python library for dynamic classifier and ensemble selection
Python
479
star
10

py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
Python
444
star
11

scikit-learn-contrib

scikit-learn compatible projects
400
star
12

project-template

A template for scikit-learn extensions
Python
316
star
13

forest-confidence-interval

Confidence intervals for scikit-learn forest algorithms
HTML
282
star
14

polylearn

A library for factorization machines and polynomial networks for classification and regression in Python.
Python
245
star
15

stability-selection

scikit-learn compatible implementation of stability selection.
Python
195
star
16

skglm

Fast and modular sklearn replacement for generalized linear models
Python
157
star
17

scikit-learn-extra

scikit-learn contrib estimators
Python
155
star
18

qolmat

A scikit-learn-compatible module for comparing imputation methods.
Python
134
star
19

hiclass

A python library for hierarchical classification compatible with scikit-learn
Python
113
star
20

scikit-dimension

A Python package for intrinsic dimension estimation
Python
78
star
21

scikit-matter

A collection of scikit-learn compatible utilities that implement methods born out of the materials science and chemistry communities
Python
76
star
22

skdag

A more flexible alternative to scikit-learn Pipelines
Python
29
star
23

denmune-clustering-algorithm

DenMune a clustering algorithm that can find clusters of arbitrary size, shapes and densities in two-dimensions. Higher dimensions are first reduced to 2-D using the t-sne. The algorithm relies on a single parameter K (the number of nearest neighbors). The results show the superiority of DenMune. Enjoy the simplicty but the power of DenMune.
Jupyter Notebook
29
star
24

mimic

mimic calibration
Python
21
star
25

sklearn-ann

Integration with (approximate) nearest neighbors libraries for scikit-learn + clustering based on with kNN-graphs.
Python
14
star
26

scikit-learn-contrib.github.io

Project webpage
HTML
4
star