• Stars
    star
    596
  • Rank 75,095 (Top 2 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 6 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code to compute permutation and drop-column importances in Python scikit-learn models

Feature importances for scikit-learn machine learning models

By Terence Parr and Kerem Turgutlu. See Explained.ai for more stuff.

The scikit-learn Random Forest feature importances strategy is mean decrease in impurity (or gini importance) mechanism, which is unreliable. To get reliable results, use permutation importance, provided in the rfpimp package in the src dir. Install with:

pip install rfpimp

We include permutation and drop-column importance measures that work with any sklearn model. Yes, rfpimp is an increasingly-ill-suited name, but we still like it.

Description

See Beware Default Random Forest Importances for a deeper discussion of the issues surrounding feature importances in random forests (authored by Terence Parr, Kerem Turgutlu, Christopher Csiszar, and Jeremy Howard).

The mean-decrease-in-impurity importance of a feature is computed by measuring how effective the feature is at reducing uncertainty (classifiers) or variance (regressors) when creating decision trees within random forests. The problem is that this mechanism, while fast, does not always give an accurate picture of importance. Strobl et al pointed out in Bias in random forest variable importance measures: Illustrations, sources and a solution that β€œthe variable importance measures of Breiman's original random forest method ... are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories.”

A more reliable method is permutation importance, which measures the importance of a feature as follows. Record a baseline accuracy (classifier) or R2 score (regressor) by passing a validation set or the out-of-bag (OOB) samples through the random forest. Permute the column values of a single predictor feature and then pass all test samples back through the random forest and recompute the accuracy or R2. The importance of that feature is the difference between the baseline and the drop in overall accuracy or R2 caused by permuting the column. The permutation mechanism is much more computationally expensive than the mean decrease in impurity mechanism, but the results are more reliable.

Sample code

See the notebooks directory for things like Collinear features and Plotting feature importances.

Here's some sample Python code that uses the rfpimp package contained in the src directory. The data can be found in rent.csv, which is a subset of the data from Kaggle's Two Sigma Connect: Rental Listing Inquiries competition.

from rfpimp import *
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

df_orig = pd.read_csv("/Users/parrt/github/random-forest-importances/notebooks/data/rent.csv")

df = df_orig.copy()

# attentuate affect of outliers in price
df['price'] = np.log(df['price'])

df_train, df_test = train_test_split(df, test_size=0.20)

features = ['bathrooms','bedrooms','longitude','latitude',
            'price']
df_train = df_train[features]
df_test = df_test[features]

X_train, y_train = df_train.drop('price',axis=1), df_train['price']
X_test, y_test = df_test.drop('price',axis=1), df_test['price']
X_train['random'] = np.random.random(size=len(X_train))
X_test['random'] = np.random.random(size=len(X_test))

rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
rf.fit(X_train, y_train)

imp = importances(rf, X_test, y_test) # permutation
viz = plot_importances(imp)
viz.view()


df_train, df_test = train_test_split(df_orig, test_size=0.20)
features = ['bathrooms','bedrooms','price','longitude','latitude',
            'interest_level']
df_train = df_train[features]
df_test = df_test[features]

X_train, y_train = df_train.drop('interest_level',axis=1), df_train['interest_level']
X_test, y_test = df_test.drop('interest_level',axis=1), df_test['interest_level']
# Add column of random numbers
X_train['random'] = np.random.random(size=len(X_train))
X_test['random'] = np.random.random(size=len(X_test))

rf = RandomForestClassifier(n_estimators=100,
                            min_samples_leaf=5,
                            n_jobs=-1,
                            oob_score=True)
rf.fit(X_train, y_train)

imp = importances(rf, X_test, y_test, n_samples=-1)
viz = plot_importances(imp)
viz.view()

Feature correlation

See Feature collinearity heatmap. We can get the Spearman's correlation matrix:

Feature dependencies

The features we use in machine learning are rarely completely independent, which makes interpreting feature importance tricky. We could compute correlation coefficients, but that only identifies linear relationships. A way to at least identify if a feature, x, is dependent on other features is to train a model using x as a dependent variable and all other features as independent variables. Because random forests give us an easy out of bag error estimate, the feature dependence functions rely on random forest models. The R^2 prediction error from the model indicates how easy it is to predict feature x using the other features. The higher the score, the more dependent feature x is.

You can also get a feature dependence matrix / heatmap that returns a non-symmetric data frame where each row is the importance of each var to the row's var used as a model target. Example:

More Repositories

1

dtreeviz

A python library for decision tree visualization and model interpretation.
Jupyter Notebook
2,921
star
2

lolviz

A simple Python data-structure visualization tool for lists of lists, lists, dictionaries; primarily for use in Jupyter notebooks / presentations
Jupyter Notebook
823
star
3

tensor-sensor

The goal of this library is to generate more helpful exception messages for matrix algebra expressions for numpy, pytorch, jax, tensorflow, keras, fastai.
Jupyter Notebook
746
star
4

bookish

A tool that translates augmented markdown into HTML or latex
Java
449
star
5

msds621

Course notes for MSDS621 at Univ of San Francisco, introduction to machine learning
Jupyter Notebook
346
star
6

simple-virtual-machine

A simple VM for a talk on building VMs
Java
207
star
7

simple-virtual-machine-C

Same as simple-virtual-machine but in C
C
136
star
8

msds692

MSAN692 Data Acquisition
HTML
125
star
9

msds501

Course notes for MSDS501, computational boot camp, at the University of San Francisco
Jupyter Notebook
123
star
10

cs652

University of San Francisco CS652 -- Programming Languages
Java
112
star
11

fundamentals-of-deep-learning

Course notes and notebooks to teach the fundamentals of how deep learning works; uses PyTorch.
Jupyter Notebook
73
star
12

msds689

Course syllabus, notes, projects for USF's MSDS689
Jupyter Notebook
64
star
13

stratx

stratx is a library for A Stratification Approach to Partial Dependence for Codependent Variables
TeX
62
star
14

ml-articles

Articles on machine learning
Jupyter Notebook
61
star
15

cs601

USF CS601 lecture notes and sample code
Java
54
star
16

msds593

MSDS593 -- Exploratory data analysis (EDA) at the University of San Francisco
Jupyter Notebook
25
star
17

website-explained.ai

The website content for explained.ai
Jupyter Notebook
23
star
18

msan501-old

USF MSAN501 lecture notes and sample code
TeX
21
star
19

mini-markdown

Parser for small subset of markdown
Java
20
star
20

cs345

CS345 Programming Languages at University of San Francisco
19
star
21

AniML-java

A Java implementation of random forest machine learning algorithm / classifier
Java
9
star
22

website-mlbook

Public repo to host website for public releases of mlbook html
HTML
8
star
23

bash-git-prompt

My own variation on the bash git prompt
Python
8
star
24

autodx

Simple automatic differentiation via operator overloading for educational purposes
TeX
7
star
25

data-acquisition

Data acquisition certificate (part of http://www.sfdatainstitute.org Course number CAS-DI-DAPY-001.
HTML
7
star
26

parrtlib

Parrt's Java library with useful functions
Java
6
star
27

gmdh

Experiment with GMDH polynomial computation-graph nodes
Python
5
star
28

msan501-starterkit

A starter kit with tests and skeleton code for the computational analytics boot camp, MSAN501, at the University of San Francisco.
Python
5
star
29

bild

A simple build utility written in Python, though I'll use to build java projects.
Python
5
star
30

c_unit

A C unit testing rig in the spirit of junit.
C
4
star
31

sample-jetbrains-plugin

A sample jetbrains plugin that uses ANTLR for lexing/parsing.
Java
4
star
32

java-neural-net

A simple neural network in java using particle swarm optimization.
Java
4
star
33

playdl

Playing with deep learning
Jupyter Notebook
3
star
34

antlr4-demo-simple-lang

Simple language grammar and listener for talk demos
Java
3
star
35

hash-duo

Explore building a hash table with two different hash functions that balances chain length
C++
3
star
36

selfnet

Playing with self-organizing deep learning neural networks
Jupyter Notebook
2
star
37

pltvid

A simple library to capture multiple matplotlib plots as a movie.
Jupyter Notebook
2
star
38

gpu-test

A test of OpenCL use on OS X, XCode. Simple vector squaring.
C
2
star
39

learn-git

1
star
40

gradle-antlr-plugin

The Official Gradle ANTLR plugin
1
star
41

cs601-webmail-skeleton

Some goodies to help start the CS601 webmail project
Java
1
star
42

cs601-webmail-st-skeleton

StringTemplate-based version of webmail skeleon
Java
1
star
43

inclass

1
star
44

foobar

1
star
45

website-book.explained.ai

HTML
1
star
46

demo

test for class
Java
1
star
47

website-faculty-parrt

My faculty web page
HTML
1
star
48

annotation-processor

Java
1
star