• Stars
    star
    227
  • Rank 175,900 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Estimating and plotting the decision boundary (decision surface) of machine learning classifiers in higher dimensions (scikit-learn compatible)

Plotting high-dimensional decision boundaries

made-with-python MIT license stability-experimental Code style: black

An experimental, scikit-learn compatible approach to plot high-dimensional decision boundaries. This facilitates intuitive understanding, and helps improve models by allowing the visual inspection of misclassified regions, model complexity, and the amount of overfitting/underfitting (augmenting more reliable but less intuitive tools such as training/test error curves and ROC curves); as well as of individual data points and their contribution to the decision surface. Finally, it shows which regions are likely to be misclassified, which are uncertain, and (in applications were active querying is possible) the proximity of which instances should be queried.

The usual approach of classifying each vertex of a 2D grid to visualize decision boundaries (see e.g. the Iris SVM example) in two dimensions breaks down when the dimensionality is higher. Although it would in principle be possible to construct a high-dimensional grid and project it down to 2D, this is intractable in practice, since the number of grid vertices to be classified grows exponentially with the number of dimensions.

Instead, the present method samples from regions in which the given classifier is maximally uncertain (close to 0.5 prediction probability), which are near the decision surface by definition. Subsequently, it projects the decision surface samples (keypoints) into 2D for visualization, and generates and classifies some test points around the estimated decision boundary for validation (see Usage section below).

Example use case: visualizing the effect of regularization parameters. Left: weakly regularized logistic regression (C=1e-9). Higher decision boundary complexity (noisy positions of cyan keypoints), more testing error (red markers). Right: regularized logistic regression (C=10). Lower complexity, no testing error. Data: sklearn toy digits (64 dimensions). Digits data - decision boundary comparison (Green and blue markers: data points, round markers: training data, square markers: testing data; red highlights: misclassified data points; cyan pentagons: decision boundary keypoints [at which the classifier predicts 0.5 probability])

Example use case: visualizing overfitting and underfitting. Left: k-nearest neighbor with k=2 neighbors. Overfits: data points misclassified (highlighted in red) due to useless 'squiggles'. Right: over-regularized support vector machine (C=1, gamma=0.07). Misses the data point on the left (marked in red) due to underfitting. Data: UCI wine dataset (13 dimensions). Wine data - decision boundary comparison

Usage

The project requires scikit-learn, matplotlib and NLopt to run.

Usage example:

digits = load_digits(n_class = 2) # data
model = KNeighborsClassifier(n_neighbors=10) # classifier able to produce probabilities

db = DBPlot(model, PCA(n_components=2))
db.fit(digits.data, digits.target)
db.plot().show() # use db.plot(generate_testpoints=False).show() to speed up plotting by skipping testpoint generation
# decision_boundary_points_2d, testpoints_2d, y_testpoints, background = db.generate_plot() # for custom plotting and analysis

Functionality & Reliability

Note: this is an early stage research project, and work in progress (it is by no means efficient or well tested)!

The core idea is using black-box optimization to find keypoints on the decision hypersurface (those points in high-dimensional space for which prediction probability is very close to 0.5) which lie between the two classes in the 2D plot, and projecting them to 2D to estimate the location of the decision boundary. Four steps of plotting a high-dimensional decision boundary

The DBPlot object takes a classifier (the decision surface of which is to be estimated) and, optionally, a dimensionality reduction method (which defaults to PCA) as inputs. Most classifiers exposing predict_proba and predict methods should work, as long as it is possible for them to reach close to 0.5 prediction probability (note that this is not the case for e.g. 1-NN or a random forests with 3 trees! In these cases, the optimization will not converge).

In terms of dimensionality reduction methods, the current version supports all matrix decomposition variants (including PCA, Kernel PCA, NMF etc.), as well as Isomap embeddings for non-linear dimensionality reduction preserving global topology, and any other method that has an implemented and exposed transform(X) function. This can include supervised dimensionality reduction, such as LDA (with solver='eigen'), which projects to the most discriminative axes.

Choosing a dimensionality reduction method resulting in acceptabe class separation is crucial for achieving interpretable results.

When inspecting graphs and improving your classifier, you can trust

  • The actual data points (large green and blue points)
  • Misclassification feedback (red circles around the data points)
  • Generated test data points colored according to your classifier predictions (tiny, faint green and blue points)

Everything else is a rough estimate intended for facilitating intuition, rather than precision; and needs to be traded off against runtime (set generate_testpoints=False, or decrease n_decision_boundary_keypoints, to decrease runtime).

  • The decision boundary keypoints (large cyan squares) are guaranteed to lie very close to the decision boundary (depending on the acceptance_threshold parameter setting). With very small tolerance, these are fairly reliable, but do NOT provide the full picture (a complete, reliable decision boundary could only be plotted with an infinite number of keypoints). To increase reliability, decrease acceptance_threshold or increase the number of decision boundary keypoints
  • The background shading reflects rough probability scores around the decision boundary, estimated from the generated test data points (its accuracy will depend on the number and coverage of these generated data points). As above, it is NOT a full picture (the generated data points do not provide full coverage, and only cover the space between the two classes, not beyond). To increase reliability, increase n_generated_testpoints_per_keypoint (or tweak the internal SVC approximating them in order to render the shading, e.g. by enabling tune_background_model=True for the plot)

All plotted information is subject to the limitations inherent in forcing high-dimensional data into a low-dimensional plot.

Installation

Clone the repo, navigate to the repo directory and run the follwing two commands:

pip install -r requirements
conda install -c conda-forge nlopt 

More Repositories

1

semisup-learn

Semi-supervised learning frameworks for python, which allow fitting scikit-learn classifiers to partially labeled data
Python
503
star
2

sklearn-expertsys

Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models
Python
486
star
3

pySeqSLAM

Python SeqSLAM - port of Niko Sรผnderhauf's OpenSeqSLAM for place recognition
Python
46
star
4

sklearn-interpretable-tree

Simplified tree-based classifier and regressor for interpretable machine learning (scikit-learn compatible)
Python
46
star
5

sklearn-random-rotation-ensembles

Scikit-learn compatible implementations of the Random Rotation Ensemble idea of (Blaser & Fryzlewicz, 2016)
Python
43
star
6

ROS-road-line-junction-extraction

3D road line extraction, bird's-eye view projection, and junction detection for ROS stereo images
Python
34
star
7

linear-SVM-on-top-of-CNN-example

Simple example showing how to use intermediate CNN layer activations as feature vectors for training a linear SVM, to create a custom image classifier
33
star
8

python-LS-SLAM

Least squares SLAM backend for pose graph-based loop closure in Python
Python
32
star
9

sklearn-random-bits-forest

Scikit-learn compatible wrapper of the Random Bits Forest program written by (Wang et al., 2016)
Python
9
star
10

Cognitive-Map-Structure-Experiment

Code for running 3D psychological experiments in the browser, intended to investigate the structure of spatial memory
JavaScript
5
star
11

Paleolithic-Cooperation-Simulation

An agent-based social simulation of a Palaeolithic human society, for investigating how cooperative behaviour can emerge and prevail
C#
3
star
12

phd-thesis

Bayesian mechanisms in spatial cognition: Towards real-world capable computational cognitive models of spatial memory
TeX
2
star
13

mobile-browser-heart-rate

Estimating users' current heart rate on a smartphone from within the browser, using accelerometer sensors
JavaScript
1
star
14

HTSP

Hierarchical convex hull-based heuristic for solving the Traveling Salesman Problem
Java
1
star