Feature Engine
Feature-engine is a Python library with multiple transformers to engineer and select features for use in machine learning models. Feature-engine's transformers follow Scikit-learn's functionality with fit() and transform() methods to learn the transforming parameters from the data and then transform it.
Feature-engine features in the following resources
Blogs about Feature-engine
-
Feature-engine: A new open-source Python package for feature engineering
-
Practical Code Implementations of Feature Engineering for Machine Learning with Python
Documentation
Current Feature-engine's transformers include functionality for:
- Missing Data Imputation
- Categorical Encoding
- Discretisation
- Outlier Capping or Removal
- Variable Transformation
- Variable Creation
- Variable Selection
- Datetime Features
- Time Series
- Preprocessing
- Scikit-learn Wrappers
Imputation Methods
- MeanMedianImputer
- RandomSampleImputer
- EndTailImputer
- AddMissingIndicator
- CategoricalImputer
- ArbitraryNumberImputer
- DropMissingData
Encoding Methods
- OneHotEncoder
- OrdinalEncoder
- CountFrequencyEncoder
- MeanEncoder
- WoEEncoder
- RareLabelEncoder
- DecisionTreeEncoder
- StringSimilarityEncoder
Discretisation methods
- EqualFrequencyDiscretiser
- EqualWidthDiscretiser
- GeometricWidthDiscretiser
- DecisionTreeDiscretiser
- ArbitraryDiscreriser
Outlier Handling methods
- Winsorizer
- ArbitraryOutlierCapper
- OutlierTrimmer
Variable Transformation methods
- LogTransformer
- LogCpTransformer
- ReciprocalTransformer
- ArcsinTransformer
- PowerTransformer
- BoxCoxTransformer
- YeoJohnsonTransformer
Variable Creation:
- MathFeatures
- RelativeFeatures
- CyclicalFeatures
Feature Selection:
- DropFeatures
- DropConstantFeatures
- DropDuplicateFeatures
- DropCorrelatedFeatures
- SmartCorrelationSelection
- ShuffleFeaturesSelector
- SelectBySingleFeaturePerformance
- SelectByTargetMeanPerformance
- RecursiveFeatureElimination
- RecursiveFeatureAddition
- DropHighPSIFeatures
- SelectByInformationValue
- ProbeFeatureSelection
Datetime
- DatetimeFeatures
- DatetimeSubtraction
Time Series
- LagFeatures
- WindowFeatures
- ExpandingWindowFeatures
Preprocessing
- MatchCategories
- MatchVariables
Wrappers:
- SklearnTransformerWrapper
Installation
From PyPI using pip:
pip install feature_engine
From Anaconda:
conda install -c conda-forge feature_engine
Or simply clone it:
git clone https://github.com/feature-engine/feature_engine.git
Example Usage
>>> import pandas as pd
>>> from feature_engine.encoding import RareLabelEncoder
>>> data = {'var_A': ['A'] * 10 + ['B'] * 10 + ['C'] * 2 + ['D'] * 1}
>>> data = pd.DataFrame(data)
>>> data['var_A'].value_counts()
Out[1]:
A 10
B 10
C 2
D 1
Name: var_A, dtype: int64
>>> rare_encoder = RareLabelEncoder(tol=0.10, n_categories=3)
>>> data_encoded = rare_encoder.fit_transform(data)
>>> data_encoded['var_A'].value_counts()
Out[2]:
A 10
B 10
Rare 3
Name: var_A, dtype: int64
Find more examples in our Jupyter Notebook Gallery or in the documentation.
Contribute
Details about how to contribute can be found in the Contribute Page
Briefly:
- Fork the repo
- Clone your fork into your local computer:
git clone https://github.com/<YOURUSERNAME>/feature_engine.git
- navigate into the repo folder
cd feature_engine
- Install Feature-engine as a developer:
pip install -e .
- Optional: Create and activate a virtual environment with any tool of choice
- Install Feature-engine dependencies:
pip install -r requirements.txt
andpip install -r test_requirements.txt
- Create a feature branch with a meaningful name for your feature:
git checkout -b myfeaturebranch
- Develop your feature, tests and documentation
- Make sure the tests pass
- Make a PR
Thank you!!
Documentation
Feature-engine documentation is built using Sphinx and is hosted on Read the Docs.
To build the documentation make sure you have the dependencies installed: from the root directory: pip install -r docs/requirements.txt
.
Now you can build the docs using: sphinx-build -b html docs build
License
BSD 3-Clause
Sponsor us
Sponsor us and support further our mission to democratize machine learning and programming tools through open-source software.