• Stars
    star
    486
  • Rank 90,527 (Top 2 %)
  • Language
    Python
  • Created over 8 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models

Highly interpretable, sklearn-compatible classifier based on decision rules

This is a scikit-learn compatible wrapper for the Bayesian Rule List classifier developed by Letham et al., 2015 (see Letham's original code), extended by a minimum description length-based discretizer (Fayyad & Irani, 1993) for continuous data, and by an approach to subsample large datasets for better performance.

It produces rule lists, which makes trained classifiers easily interpretable to human experts, and is competitive with state of the art classifiers such as random forests or SVMs.

For example, an easily understood Rule List model of the well-known Titanic dataset:

IF male AND adult THEN survival probability: 21% (19% - 23%)
ELSE IF 3rd class THEN survival probability: 44% (38% - 51%)
ELSE IF 1st class THEN survival probability: 96% (92% - 99%)
ELSE survival probability: 88% (82% - 94%)

Letham et al.'s approach only works on discrete data. However, this approach can still be used on continuous data after discretization. The RuleListClassifier class also includes a discretizer that can deal with continuous data (using Fayyad & Irani's minimum description length principle criterion, based on an implementation by navicto).

The inference procedure is slow on large datasets. If you have more than a few thousand data points, and only numeric data, try the included BigDataRuleListClassifier(training_subset=0.1), which first determines a small subset of the training data that is most critical in defining a decision boundary (the data points that are hardest to classify) and learns a rule list only on this subset (you can specify which estimator to use for judging which subset is hardest to classify by passing any sklearn-compatible estimator in the subset_estimator parameter - see examples/diabetes_bigdata_demo.py).

Usage

The project requires pyFIM, scikit-learn, and pandas to run.

The included RuleListClassifier works as a scikit-learn estimator, with a model.fit(X,y) method which takes training data X (numpy array or pandas DataFrame; continuous, categorical or mixed data) and labels y.

The learned rules of a trained model can be displayed simply by casting the object as a string, e.g. print model, or by using the model.tostring(decimals=1) method and optionally specifying the rounding precision.

Numerical data in X is automatically discretized. To prevent discretization (e.g. to protect columns containing categorical data represented as integers), pass the list of protected column names in the fit method, e.g. model.fit(X,y,undiscretized_features=['CAT_COLUMN_NAME']) (entries in undiscretized columns will be converted to strings and used as categorical values - see examples/hepatitis_mixeddata_demo.py).

Usage example:

from RuleListClassifier import *
from sklearn.datasets.mldata import fetch_mldata
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

feature_labels = ["#Pregnant","Glucose concentration test","Blood pressure(mmHg)","Triceps skin fold thickness(mm)","2-Hour serum insulin (mu U/ml)","Body mass index","Diabetes pedigree function","Age (years)"]
    
data = fetch_mldata("diabetes") # get dataset
y = (data.target+1)/2 # target labels (0 or 1)
Xtrain, Xtest, ytrain, ytest = train_test_split(data.data, y) # split

# train classifier (allow more iterations for better accuracy; use BigDataRuleListClassifier for large datasets)
model = RuleListClassifier(max_iter=10000, class1label="diabetes", verbose=False)
model.fit(Xtrain, ytrain, feature_labels=feature_labels)

print "RuleListClassifier Accuracy:", model.score(Xtest, ytest), "Learned interpretable model:\n", model
print "RandomForestClassifier Accuracy:", RandomForestClassifier().fit(Xtrain, ytrain).score(Xtest, ytest)
"""
**Output:**
RuleListClassifier Accuracy: 0.776041666667 Learned interpretable model:
Trained RuleListClassifier for detecting diabetes
==================================================
IF Glucose concentration test : 157.5_to_inf THEN probability of diabetes: 81.1% (72.5%-72.5%)
ELSE IF Body mass index : -inf_to_26.3499995 THEN probability of diabetes: 5.2% (1.9%-1.9%)
ELSE IF Glucose concentration test : -inf_to_103.5 THEN probability of diabetes: 14.4% (8.8%-8.8%)
ELSE IF Age (years) : 27.5_to_inf THEN probability of diabetes: 59.6% (51.8%-51.8%)
ELSE IF Glucose concentration test : 103.5_to_127.5 THEN probability of diabetes: 15.9% (8.0%-8.0%)
ELSE probability of diabetes: 44.7% (29.5%-29.5%)
=================================================

RandomForestClassifier Accuracy: 0.729166666667
"""

More Repositories

1

semisup-learn

Semi-supervised learning frameworks for python, which allow fitting scikit-learn classifiers to partially labeled data
Python
503
star
2

highdimensional-decision-boundary-plot

Estimating and plotting the decision boundary (decision surface) of machine learning classifiers in higher dimensions (scikit-learn compatible)
Python
227
star
3

pySeqSLAM

Python SeqSLAM - port of Niko Sรผnderhauf's OpenSeqSLAM for place recognition
Python
46
star
4

sklearn-interpretable-tree

Simplified tree-based classifier and regressor for interpretable machine learning (scikit-learn compatible)
Python
46
star
5

sklearn-random-rotation-ensembles

Scikit-learn compatible implementations of the Random Rotation Ensemble idea of (Blaser & Fryzlewicz, 2016)
Python
43
star
6

ROS-road-line-junction-extraction

3D road line extraction, bird's-eye view projection, and junction detection for ROS stereo images
Python
34
star
7

linear-SVM-on-top-of-CNN-example

Simple example showing how to use intermediate CNN layer activations as feature vectors for training a linear SVM, to create a custom image classifier
33
star
8

python-LS-SLAM

Least squares SLAM backend for pose graph-based loop closure in Python
Python
32
star
9

sklearn-random-bits-forest

Scikit-learn compatible wrapper of the Random Bits Forest program written by (Wang et al., 2016)
Python
9
star
10

Cognitive-Map-Structure-Experiment

Code for running 3D psychological experiments in the browser, intended to investigate the structure of spatial memory
JavaScript
5
star
11

Paleolithic-Cooperation-Simulation

An agent-based social simulation of a Palaeolithic human society, for investigating how cooperative behaviour can emerge and prevail
C#
3
star
12

phd-thesis

Bayesian mechanisms in spatial cognition: Towards real-world capable computational cognitive models of spatial memory
TeX
2
star
13

mobile-browser-heart-rate

Estimating users' current heart rate on a smartphone from within the browser, using accelerometer sensors
JavaScript
1
star
14

HTSP

Hierarchical convex hull-based heuristic for solving the Traveling Salesman Problem
Java
1
star