Feature Engineering for Machine Learning - Code Repository
Code repository for the online course Feature Engineering for Machine Learning
Published November, 2017
Actively maintained.
Table of Contents
-
Introduction: Variable Types
- Numerical Variables: Discrete and continuous
- Categorical Variables: Nominal and Ordinal
- Datetime variables
- Mixed variables: strings and numbers
-
Variable Characteristics
- Missing Data
- Cardinality
- Category Frequency
- Distributions
- Outliers
- Magnitude
-
Missing Data Imputation
- Mean and Median Imputation
- Arbitrary value imputation
- End of Tail Imputation
- Frequent category imputation
- Adding string missing
- Random Sample Imputation
- Adding a missing indicator
- Imputation with Scikit-learn
- Imputation with Feature-engine
-
Multivariate Imputation
- MICE
- KNN imputation
-
Categorical Variable Encoding
- One hot encoding: simple and of frequent categories
- Ordinal encoding: arbitrary and ordered
- Target mean encoding
- Weight of evidence
- Rare Label encoding
- Encoding with Scikit-learn
- Encoding with Feature-engine
- Encoding with category encoders
-
Variable Transformation
- Log, power and reciprocal
- Box-Cox
- yeo-Johnson
- Transformation with Scikit-learn
- Transformation with Feature-engine
-
Discretisation
- Arbitrary
- Equal-frequency discretisation
- Equal-width discretisation
- K-means discretisation
- Discretisation with trees
- Discretisation with Scikit-learn
- Discretisation with Feature-engine
-
Outliers
- Capping
- Trimming
-
Datetime
- Extracting day, month, week, etc
- Extracting hr, min, sec, etc
- Capturing elapsed time
- Working with timezones
-
Mixed variables
- Creating new variables from strings and numbers
-
Feature creation
- Sum, prod, count, mean, std, etc
- Div, sub
- Polynomial expansion
- Splines
-
Feature Scaling
- Standardisation
- MinMaxScaling
- MaxAbsoluteScaling
- RobustScaling
-
Pipelines
- Classification Pipeline
- Regression Pipeline
- Pipeline with cross-validation