• Stars
    star
    136
  • Rank 267,670 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 4 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Reliability diagrams visualize whether a classifier model needs calibration

Reliability diagrams

A classifier with a sigmoid or softmax layer outputs a number between 0 and 1 for each class, which we tend to interpret as the probability that this class was detected. However, this is only the case if the classifier is calibrated properly!

The paper On Calibration of Modern Neural Networks by Guo et al. (2017) claims that modern, deep neural networks are often not calibrated. As a result, interpreting the predicted numbers as probabilities is not correct.

When a model is calibrated, the confidence score should equal the accuracy. For example, if your test set has 100 examples for which the model predicts 0.8, the accuracy over those 100 examples should be 80%. In other words, if 0.8 is a true probability, the model should get 20% of these examples wrong! For all the examples with confidence score 0.9, the accuracy should be 90%, and so on.

One way to find out how well your model is calibrated, is to draw a reliability diagram. Here is the reliability diagram for the model gluon_senet154, from pytorch-image-models, trained on ImageNet:

These results were computed over the ImageNet validation set of 50,000 images.

The top part of this image is the reliability diagram. The bottom part is a confidence histogram.

How to interpret these plots?

First, the model's predictions are divided into bins based on the confidence score of the winning class. For each bin we calculate the average confidence and average accuracy.

The confidence histogram at the bottom shows how many test examples are in each bin. Here, we used 20 bins. It is clear from the histogram that most predictions of this model had a confidence of > 0.8.

The two vertical lines indicate the overall accuracy and average confidence. The closer these two lines are together, the better the model is calibrated.

The reliability diagram at the top shows the average confidence for each bin, as well as the accuracy of the examples in each bin.

Usually the average confidence for a bin lies on or close to the diagonal. For example, if there are 20 bins, each bin is 0.05 apart. Then the average confidence for the bin (0.9, 0.95] will typically be around 0.925. (The first bin is an exception: with softmax, the probability of the winning prediction must be at least larger than 1/num_classes, so this pushes the average confidence up a bit for that bin.)

For each bin we plot the difference between the accuracy and the confidence. Ideally, the accuracy and confidence are equal. In that case, the model is calibrated and we can interpret the confidence score as a probability.

If the model is not calibrated, however, there is a gap between accuracy and confidence. These are the red bars in the diagram. The larger the bar, the greater the gap.

The diagonal is the ideal accuracy for each confidence level. If the red bar goes below the diagonal, it means the confidence is larger than the accuracy and the model is too confident in its predictions. If the red bar goes above the diagonal, the accuracy is larger than the confidence, and the model is not confident enough.

The black lines in the plot indicate the average accuracy for the examples in that bin:

  • If the black line is at the bottom of a red bar, the model is over-confident for the examples in that bin.

  • If the black line is on top of a red bar, the model is not confident enough in its predictions.

By calibrating the model, we can bring these two things more in line with one another. Note that, when calibrating, the model's accuracy doesn't change (although this may depend on the calibration method used). It just fixes the confidence scores so that a prediction of 0.8 really means the model is correct 80% of the time.

For the gluon_senet154 plot above, notice how most of the gaps extend above the diagonal. This means the model is more accurate than it thinks. Only for the bin (0.95, 1.0] is it overestimating its accuracy. For the bins around 0.5, the calibration is just right.

Because not every bin has the same number of examples, some bins affect the calibration of the model more than others. You can see this distribution in the histrogram. To make the importance of the bins even clearer, the red bars are darker for bins with more examples and lighter for bins with fewer examples. It's immediately clear that most predictions for this model have a confidence between 0.85 and 0.9, but that the accuracy for this bin is actually more like ~0.95.

The top diagram also includes the ECE or Expected Calibration Error. This is a summary statistic that gives the difference in expectation between confidence and accuracy. In other words, it's an of the gaps across all bins, weighed by the number of examples in each bin. Lower is better.

Is it bad to be more accurate than confident?

The models in the Guo paper are too confident. On examples for which those models predict 90% confidence, the accuracy is something like 80%. That obviously sounds like it's a problem.

But in my own tests, so far, I found that the accuracy is actually larger than the confidence in most bins, meaning the model is underestimating the confidence scores. You can see this in the plot above: the black lines (indicating the accuracy) almost all lie above the diagonal.

Is that a bad thing? A model being more accurate than it is confident doesn't sound so bad...

So, which is worse: a model that is too confident, or a model that is not confident enough?

  • Over-confidence (conf > acc): this gives more false positives. You make observations that are not actually true. This is the same as setting your decision threshold lower, so more comes through.

  • Under-confidence (acc > conf): this gives more false negatives. This is as if you're using a higher decision threshold, so you lose more observations.

Which is worse depends on the use case, I guess. But if you want to be able to properly interpret the predictions as probabilities -- for example, because you want to feed the output from the neural network into another probabilistic model -- then you don't want the gaps in the reliability diagram to be too large. Remember that calibrating doesn't change the accuracy, it just shifts the confidences around so that 0.9 really means you get 90% correct.

OK, so my model is not calibrated, how do I fix it?

See the Guo paper for techniques. There have also been follow-up papers with new techniques.

The goal of this repo is only to help visualize whether a model needs calibration or not. Think of it as being a part of a diagnostic toolkit for machine learning models.

The code

Python code to generate these diagrams is in reliability_diagrams.py.

The notebook Plots.ipynb shows how to use it.

The folder results contains CSV files with the predictions of various models. The CSV file has three columns:

true_label, pred_label, confidence

For a multi-class model, the predicted label and the confidence are for the highest-scoring class.

To generate a reliability diagram for your own model, run it on your test set and output a CSV file in this format.

Currently included are results for models from:

Interestingly, the models from pytorch-image-models tend to under-estimate the confidence, while similar models from torchvision are overconfident (both are trained on ImageNet).

The figures folder contains some PNG images of these reliability diagrams.

License: MIT

More Repositories

1

neural-engine

Everything we actually know about the Apple Neural Engine (ANE)
2,049
star
2

CoreMLHelpers

Types and functions that make it a little easier to work with Core ML in Swift.
Swift
1,373
star
3

Forge

A neural network toolkit for Metal
Swift
1,270
star
4

YOLO-CoreML-MPSNNGraph

Tiny YOLO for iOS implemented using CoreML but also using the new MPS graph API.
Swift
936
star
5

MobileNet-CoreML

The MobileNet neural network using Apple's new CoreML framework
Swift
705
star
6

MHTabBarController

A custom tab bar controller for iOS 5
Objective-C
488
star
7

TensorFlow-iOS-Example

Source code for my blog post "Getting started with TensorFlow on iOS"
Swift
441
star
8

BlazeFace-PyTorch

The BlazeFace face detector model implemented in PyTorch
Jupyter Notebook
427
star
9

coreml-survival-guide

Source code for the book Core ML Survival Guide
Python
246
star
10

MHRotaryKnob

UIControl for iOS that acts like a rotary knob
Objective-C
196
star
11

VGGNet-Metal

iPhone version of the VGGNet convolutional neural network for image recognition
Swift
182
star
12

Swift-3D-Demo

Shows how to draw a 3D object without using shaders
Swift
180
star
13

synth-plugin-book

Source code for the book Code Your Own Synth Plug-Ins With C++ and JUCE
C++
172
star
14

MHLazyTableImages

This project is now deprecated.
Objective-C
157
star
15

SoundBankPlayer

Sample-based audio player for iOS that uses OpenAL.
Objective-C
156
star
16

MHPagingScrollView

A UIScrollView subclass that shows previews of the pages on the left and right.
Objective-C
132
star
17

mda-plugins-juce

JUCE implementations of the classic MDA audio plug-ins
C
124
star
18

TheKissOfShame

DSP Magnetic Tape Emulation
C++
103
star
19

metal-gpgpu

Collection of notes on how to use Apple’s Metal API for compute tasks
101
star
20

coreml-training

Source code for my blog post series "On-device training with Core ML"
Jupyter Notebook
99
star
21

Inception-CoreML

Running Inception-v3 on Core ML
Swift
97
star
22

Matrix

A fast matrix type for Swift
Swift
94
star
23

AudioBufferPlayer

Class for doing simple iOS sound synthesis using Audio Queues.
Objective-C
85
star
24

MHNibTableViewCell

This code is now deprecated.
Objective-C
79
star
25

CoreML-Custom-Layers

Source code for the blog post "Custom Layers in Core ML"
Swift
72
star
26

InsideCoreML

Python script to examine Core ML's mlmodel files
Python
64
star
27

BNNS-vs-MPSCNN

Compares the speed of Apple's two deep learning frameworks: BNNS and Metal Performance Shaders
Swift
61
star
28

TransparentJPEG

Allows you to combine a JPEG with a second image to give it transparency.
Objective-C
59
star
29

TinyML-HelloWorld-ArduinoUno

The TinyML "Hello World" sine wave model on Arduino Uno v3
Jupyter Notebook
48
star
30

synth-recipes

Code snippets of sound synthesis algorithms in C++
C++
47
star
31

WashedOut

Color theme for Xcode 8 based on the colors from the WWDC 2016 slides
42
star
32

RNN-Drummer-Swift

Using a recurrent neural network to teach the iPhone to play drums
Python
42
star
33

BuildYourOwnLispInSwift

A simple LISP interpreter written in Swift
Swift
36
star
34

SemanticSegmentationMetalDemo

Drawing semantic segmentation masks with Metal
Swift
33
star
35

krunch

Lowpass filter + saturation audio effect plug-in
C++
32
star
36

MHSemiModal

Category on UIViewController that makes it easy to present modal view controllers that only partially cover the screen.
Objective-C
31
star
37

Deepfish

Live visualization of convolutional neural network using the iPhone's camera
Swift
24
star
38

MPS-Matrix-Multiplication

Playing with the Metal Performance Shaders matrix multiplication kernel
Swift
24
star
39

fft-juce

Example code for my blog post FFT Processing in JUCE
C++
23
star
40

MHPopoverManager

A simple class for managing the lifecycle of your UIPopoverControllers
Objective-C
23
star
41

sefr-swift

The SEFR classifier implemented in Swift
Swift
21
star
42

Railroad-Diagrams-Swift

Library for making railroad diagrams in Swift
Swift
19
star
43

AVBufferPlayer

Shows how to use AVAudioPlayer to play a buffer of waveform data that you give it.
Objective-C
17
star
44

GalaxyApocalypse

My January 2013 game for #OneGameADay (iPhone). The galaxy is falling apart and it's your job to move all the planets back to where they belong. Lots of swiping involved.
Objective-C++
15
star
45

airwindows-juce

JUCE versions of selected Airwindows plug-ins
C++
12
star
46

MHTintHelper

Tool that quickly lets you pick tint colors for navigation bars etc.
Objective-C
12
star
47

levels

Basic digital level meter plug-in.
C++
11
star
48

MHDatabase

A simple Objective-C wrapper around the sqlite3 functions.
Objective-C
11
star
49

Ignition

PyTorch helper code
Python
10
star
50

Logistic-Regression-Swift

A basic example of how to implement logistic regression in Swift
Swift
9
star
51

ShrinkPng

Simple tool for shrinking images 50% by averaging the color (and alpha) of each 2x2 pixel block.
Objective-C
9
star
52

MHOverlayWindow

A simple example of how to make a UIWindow that appears on top of everything else, including the status bar.
Objective-C
7
star
53

pumpkin

Everything must bounce!
Swift
7
star
54

MHOverride

Category on NSObject that lets you override methods on existing objects using blocks, without having to make a subclass.
Objective-C
7
star
55

ThreeBandEQ

Simple bass/mids/treble equalizer plugin written in JUCE
C++
5
star
56

bombaz

Simple bass synth VSTi based on window function synthesis
C++
5
star
57

RWDevCon-App-Architecture

Source code for my RWDevCon talk on app architecture.
Swift
3
star
58

MHMetaColors

Category that allows you to write, for example, [UIColor xFF3399] to make a new UIColor object with values #FF3399.
Objective-C
3
star
59

rubberneck

handy utility for monitoring levels and protecting ears while developing plug-ins
C++
3
star
60

RWDevCon-Swift-Closures-Generics

Source code for my RWDevCon talk on Swift closures and generics.
Swift
2
star
61

hollance

1
star
62

hollance.github.io

CSS
1
star