• Stars
    star
    370
  • Rank 115,405 (Top 3 %)
  • Language
    Jupyter Notebook
  • License
    Other
  • Created over 5 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code repository for the online course Feature Engineering for Machine Learning

PythonVersion License https://github.com/solegalli/feature-engineering-for-machine-learning/blob/master/LICENSE Sponsorship https://www.trainindata.com/

Feature Engineering for Machine Learning - Code Repository

Code repository for the online course Feature Engineering for Machine Learning

Published November, 2017

Actively maintained.

Table of Contents

  1. Introduction: Variable Types

    1. Numerical Variables: Discrete and continuous
    2. Categorical Variables: Nominal and Ordinal
    3. Datetime variables
    4. Mixed variables: strings and numbers
  2. Variable Characteristics

    1. Missing Data
    2. Cardinality
    3. Category Frequency
    4. Distributions
    5. Outliers
    6. Magnitude
  3. Missing Data Imputation

    1. Mean and Median Imputation
    2. Arbitrary value imputation
    3. End of Tail Imputation
    4. Frequent category imputation
    5. Adding string missing
    6. Random Sample Imputation
    7. Adding a missing indicator
    8. Imputation with Scikit-learn
    9. Imputation with Feature-engine
  4. Multivariate Imputation

    1. MICE
    2. KNN imputation
  5. Categorical Variable Encoding

    1. One hot encoding: simple and of frequent categories
    2. Ordinal encoding: arbitrary and ordered
    3. Target mean encoding
    4. Weight of evidence
    5. Rare Label encoding
    6. Encoding with Scikit-learn
    7. Encoding with Feature-engine
    8. Encoding with category encoders
  6. Variable Transformation

    1. Log, power and reciprocal
    2. Box-Cox
    3. yeo-Johnson
    4. Transformation with Scikit-learn
    5. Transformation with Feature-engine
  7. Discretisation

    1. Arbitrary
    2. Equal-frequency discretisation
    3. Equal-width discretisation
    4. K-means discretisation
    5. Discretisation with trees
    6. Discretisation with Scikit-learn
    7. Discretisation with Feature-engine
  8. Outliers

    1. Capping
    2. Trimming
  9. Datetime

    1. Extracting day, month, week, etc
    2. Extracting hr, min, sec, etc
    3. Capturing elapsed time
    4. Working with timezones
  10. Mixed variables

    1. Creating new variables from strings and numbers
  11. Feature creation

    1. Sum, prod, count, mean, std, etc
    2. Div, sub
    3. Polynomial expansion
    4. Splines
  12. Feature Scaling

    1. Standardisation
    2. MinMaxScaling
    3. MaxAbsoluteScaling
    4. RobustScaling
  13. Pipelines

    1. Classification Pipeline
    2. Regression Pipeline
    3. Pipeline with cross-validation

Links