• Stars
    star
    179
  • Rank 214,039 (Top 5 %)
  • Language
    Python
  • Created over 6 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Implementing Bayes by Backprop

Introduction

The code in this repository implements Bayesian inference on a deep neural network. The repository also serves as notes for my talk at PyData Amsterdam 2018 Bayesian Deep Learning with 10 % of the weights

Getting started

Move your console to the outer weight_uncertainty directory:

> cd weight_uncertainty   (you are now in the outer weight_uncertainty directory)
> pip install -e .
> python weight_uncertainty/main.py

These commands install the repo and run the training process

Motivation

Conventional neural networks suffer from two problems, which motivate this repository:

  • Conventional neural networks give no uncertainty on their predictions.

    • This is detrimental for critical applications. For example, if a neural network diagnoses you with a disease, wouldn't you want to know how certain it is of that diagnosis?
    • This also makes neural networks susceptible to adversarial attacks. In adversarial attacks, imperceptible changes to the input results in vastly different predictions. We desire that a neural network gives high uncertainty when we input an adversarial input.
  • Conventional neural networks have millions of parameters

    • This is detrimental for mobile applications. In mobile applications, we often have small memory and not much computation power. If we can prune the parameters, we would take up less memory and need fewer compute to make a prediction
    • (There are some speculations that the redundant parameters make it easier for adversarial attacks, but that is just a hypothesis.)

This repository proposes a solution to both problems.

Short summary of solution

In short: in conventional learning of neural nets, we use SGD to find one parameter vector. In this project, we are going to find multiple parameter vectors. When making a prediction, we average the outputs of the neural net with each parameter vector. You can think of this as an ensemble method.

I hear you asking: how do we get multiple parameter vectors? Answer: we sample them from the posterior over our parameters.

We infer a posterior over our parameters according to Bayes rule: $p(w|data) \propto p(data|w)p(w)$. This posterior helps us in two ways:

  • The predictions using the parameter posterior naturally give us uncertainty in our predictions. $p(y|x) = \int_w p(p|x,w)p(w|data)dw$
  • The posterior tells us which parameters assign a high probability to being zero. We will prune these parameters.

Parameter posterior

Let us first write down the posterior. For the posterior, we need a likelihood and a prior. In this repository we deal with classification, so our likelihood is the probability of the prediction for the correct class. We choose a Gaussian prior over our parameters. The prior might sound like a new concept to many people, but I want to convince you that we have been using priors all the time. When we do L2 regularisation or when we do weight decay, that corresponds to assuming a Gaussian prior on the parameters.

$p(w|data) \propto p(data|w)p(w)$

$log p(w|data) =  log p(data|w) + log p(w) + constant$

$log p(w|data) =  classification \ loss + \lambda \sum_i w_i^2+ constant$

So actually, we have been using the parameter posterior all the time when we did L2 regularisation. However, in conventional learning, we used only one parameter vector from this posterior. In this repository, we want to sample multiple parameter vectors from the posterior.

How do we sample from the posterior?

Exact sampling from the posterior is hard. Therefore, we make a local approximation to the posterior that we can easily sample. We want a richer approximation than a point approximation. But we also do not want to overcomplicate matters. Therefore, we approximate the posterior with a Gaussian. The Gaussian is ideal, because:

  • The Gaussian distribution can capture the local structure of the true posterior. This will tell us about the behavior of parameter vectors: which parameters can assume a wide range of values, and which parameters are fairly restricted.
  • The Gaussian distribution has a simple form that we can use for pruning. Each parameter will have a mean and a standard deviation. With the mean and standard deviation, we calculate the zero probability in one simple line. So pruning will be efficient.

Loss function

We will find our approximation via stochastic gradient descent. This time, however, the loss function for SGD differs a little bit.

Remember that the old loss function was:

$log p(w|data) =  classification \ loss + \lambda \sum_i w_i^2$

Then our new loss function becomes:

$loss = classification loss + \sum_i - \log\sigma_i + \frac{1}{2}\lambda \sigma^2 +  \frac{1}{2}\lambda\mu^2$

What changed in the loss function?

  • Both loss functions have the classification loss
  • Both loss functions have a squared penalty on the mean of the parameter vector
  • The new loss function has an additional penalty on $\sigma$. This penalty penalizes small sigma's. In other words, this loss function promotes large values of sigma. In the im directory, you find a figure of this penalty term, named loss_sigma.png

Let's see some code

At PyData, we love python. So let's write this out in python.

We would train conventional neural networks like so:

while not converged:
  # Get the loss
  x, y = sample_batch()
  loss = loss_function(x, y, w)

  #Update the parameters
  w_grad = gradient(loss, w)
  w = update(w, w_grad)

In Bayesian inference, we make an approximation to the posterior. So we would approximate the posterior like so

while not converged:
  # Get the loss
  x, y = sample_batch()
  w = approximation.sample()
  loss = loss_function(x, y, w)

  # Update the approximation
  w_grad = gradient(loss, w)
  approximation = update(approximation, w_grad)

I made a separate document in /docs/ to explain in a formal sense why this new loss function works for approximation the parameter posterior. Please read it at your own risk :) You can read, use and enjoy this entire repository without ever reading it.

Making predictions with uncertainty

Now that we have sampled parameter vectors, let's use them to make predictions and get uncertainties. What we want to know is the probability for an output class, given the input. We will make this prediction by averaging the output of the neural net with each of the parameter vectors:

Again, we love python, so let's write some python:

def sample_prediction(input):
    for _ in range(num_samples):
        w = approximation.sample()  
        yield model.predict(input, w)
prediction = np.mean(sample_prediction(input))

(RestoredModel.predict() in util.util.py implements exactly this)

What does this code do?

  • For many times, we sample a parameter vector from our approximation. We use the sampled parameter vector to make one prediction
  • Our final prediction is the average of all the sampled predictions.

In this project, we work with classification. Therefore, $p(y|x)$ is a vector of num_classes dimension. Each entry in the vector tells the probability that the input belongs to that class.

For example, if our classification problem concerns cats, dogs and cows. Then prediction[1] tells the probability that in input is a dog.

Intuition for the averaging

Why does it help to sample many parameter vectors and average them?

Three types of intuition:

  • Intuition: This averaging looks like an ensemble method. More models know more than one model.
  • Robust: Think about the adversarial examples. An image might be an adversarial input for one model, but it is hard to be adversarial for all the models, so we average out this adversarial prediction.
  • Formal: This sampling and averaging approximates the posterior predictive distribution: $p(y|x) = \int_w p(p|x,w)p(w|data)dw$

(When I say different models, I mean to say: our model with different parameter vectors.)

Getting the uncertainty

How do we get one number that tells us the uncertainty of our prediction? We have a full posterior predictive distribution, $p(y|x)$. We want one number that quantifies the uncertainty.

There are many choices for this one number to summarize the uncertainty

  • Use the predicted probability prediction[i]
  • Use the variance in the predicted probabilities np.var(sample_prediction(input))[i]
  • Use the variation ratio np.mean(np.argmax(sample_prediction(input),axis=1))
  • Use the predictive entropy entropy(prediction)
  • Use the mutual information between parameters and labels entropy(prediction) - np.mean(entropy(sample_predictions(input),axis=1))

If you are interested in comparing these uncertainty quantifiers, this paper compares them.

What we really care about is which uncertainty quantifier makes us robust againt adversarial attacks. Fortunately, the authors of this paper compare the uncertainty quantifiers when under adversarial attacks. They conclude that both the variation ratio, predictive entropy and the mutual information increase for adversarial inputs. I care about simplicity, so I will use the predictive entropy in the rest of the project.

How to prune the parameters?

Now let's answer how to prune the parameters. We have neural network with millions of weights. We want to drop many of them or at least zero them out. The question we face is the following: which parameters should we drop first?.

Intuitively, we drop the parameters first that are least useful. For example, if a parameter has a high posterior probability of being zero, we might as well drop it. Conversely, if a parameter has a low posterior probability of being zero, we want to keep it. We follow this intuition as we prune parameters: 1) we pick a threshold for the zero probability and 2) we sweep over all the parameters and drop the ones whose probability at zero is above the threshold.

Again, PyData loves python, so let's write some python

for param, mu, sigma in approximation():
    zero_probability = normal.pdf(mu, sigma, 0.0)
    if zero_probability > threshold:
        model.drop(param)

For the corresponding code in the project, see: RestoredModel.pruning(threshold)

Experiments and results

For the experiments, we run the Bayesian neural network on three data set:

  • First, we want an easy data set that everyone understands. Therefore, we pick MNIST
  • Second, we want an application that many people care about: image classification. Therefore, we pick CIFAR10. It is also more applicable than MNIST
  • Third, we want a time series data set, as it is a common application of neural networks. We also want to show that Bayesian neural networks do not overfit. Therefore, we pick the ECG5000 data set from UCR archive. The train set contains only 500 time series, so we know that a conventional neural network would overfit.

For each data set, we care about the following experiments

  • How does the pruning curve look like? Do we remain performance as we drop the parameters?
  • What do examples of certain and uncertain inputs look like? Does uncertainty increase for noisy inputs?

To this end, we have three plots per data set:

  • A pruning curve: the horizontal axis changes the portion of weights being dropped. The vertical axis indicates the validation performance. We expect that the validation performance remains good when less than 90% of the parameters are dropped. (That is also the title of the PyData talk)
  • Examples of inputs: we randomly sample some images from the validation set and we mutilate them by either adding noise or rotating them. As mutilation increases, we expect the uncertainty to increase too.
  • Uncertainty curves: we dive further in our uncertainty numbers and our expectation that they increase for more mutilation. For each mutilation, we plot the uncertainty number as a function of mutilation value (like the energy of the noise or the angle of rotation). This plot will confirm on aggregate level that uncertainty increases for more mutilation.

MNIST

Pruning curve pruning_curve_mnist

Examples and the uncertainty curves are in the presentation

CIFAR10

Pruning curve pruning_curve_cifar

Examples and the uncertainty curves are in the presentation

ECG5000

Pruning curve pruning_curve_ucr

Examples and the uncertainty curves are in the presentation

Summary

Our motivation for this project concerns two problems with neural networks: uncertainty and pruning. Conventional neural networks use one parameter vector. We use the posterior and sample many parameter vectors. For a prediction, we average the output of the neural net with each parameter vector. We find the uncertainty as the entropy of the posterior predictive distribution. We prune parameters whose probability of being zero exceeds a threshold. Our experiment show that we can prune 90% of the parameters while maintaining performance. We also show pictures to get intuition for our uncertainty numbers.

Our experiment are small. This paper does more extensive speed comparisons. This paper shows how the uncertainty increases under stronger adversarial attacks.

I hope that this code is useful to you. Contact me at [email protected] if I can help more. (Please understand that I get many emails: Formulate a concise question)

Further reading

More Repositories

1

AE_ts

Auto encoder for time series
Python
428
star
2

LSTM_tsc

An LSTM for time-series classification
Python
407
star
3

CNN_tsc

A CNN for time-series classification
Python
155
star
4

RNN_basketball

LSTM + MDN for basketball trajectories
Python
142
star
5

bigclam

Implements the bigCLAM algorithm
Python
50
star
6

ssl_graph

Semi supervised learning on graphs
Python
35
star
7

bayes_nn

Uncertainty interpretations of the neural network
Python
30
star
8

cnn_music

A CNN for music genre classification and TSC in general
MATLAB
30
star
9

bandit

Implementation of Counterfactual risk minimization
Python
26
star
10

segm

Simple Semantic Segmentation
Python
17
star
11

VAE_rec

Variational Recurrent Auto Encoder
Python
16
star
12

EM

visualizing Expectation Maximization
Python
16
star
13

vi_normal

Variational Inference for a Normal Distribution
Python
12
star
14

tensorflow_basic

Basic start of TensorFlow with TensorBoard
Python
11
star
15

EDS

Github Repo for Eindhoven Data Science MeetUp
11
star
16

dan

Domain Agnostic Normalization layer for Unsupervised Domain Adaptation
Python
11
star
17

attention

An elementary example of soft attention
Python
8
star
18

ts_clust

WIP to cluster time series
Python
7
star
19

hypothesis_kalman

Hypothesis testing for time series: two buckets are from same source or different?
Python
7
star
20

ladder

LADDER network after Harri Valpola
Python
7
star
21

LSTM_cpd

LSTM for change point detection
Python
7
star
22

dpm

Dirichlet Process Mixtures
Python
6
star
23

ssl

Implementation of self ensembling for semi supervised learning
Python
6
star
24

overview

Overview diagram and lists of Machine Learning
HTML
6
star
25

awesome_distributions

Scribbles about marginalization property of Gaussian and aggregation property of Dirichlet
Python
6
star
26

cs231n

Files for the cs231n course at Stanford
Jupyter Notebook
5
star
27

FCN

Fully Convolutional Network
Python
4
star
28

rbm

Restricted Boltzmann Machine
Python
3
star
29

occam

Implementing Occam's razor from the Statistical and Bayesian views.
Python
3
star
30

indian_buffet

Indian buffet latent variable model
Python
3
star
31

RAM

Recurrent Attention Model
Python
3
star
32

VAE

Variational Auto encoder
Python
3
star
33

DRAW

DRAW
Python
3
star
34

rbfn_learnable

RBFN with learnable RBF parameters
Python
2
star
35

ssl_rep

Combining insights from representation learning and semi supervised learning
Python
2
star
36

bbvi

Comparing variance of gradient estimators for BBVI
Python
2
star
37

q_learning

Implementation of epsilon-greedy q-learning for robot on 2D grid
MATLAB
2
star
38

mcmc_proposals

Compare different proposal distro's for MCMC
Python
2
star
39

hypothesis_testing

Project on hypothesis testing
Python
2
star
40

bayesian_model_comparison

Three approaches for Bayesian model comparison
Python
2
star
41

RBFN_two_MNIST

Radial basis function network for two classes of the MNIST dataset
MATLAB
1
star
42

far_away

Exploring linear classifiers on inputs far away from the training data
Python
1
star
43

dpbmm

Dirichlet Process Mixture model for Multinoulli distributions
Python
1
star
44

hacking_180824

hacking_180824
Python
1
star
45

trees_ensemble

This projects aggregates a Bagging, Random Forest and MLP classifier for data competition
Python
1
star
46

talks

repo for my talks
1
star