SimpleSelfAttention (Created 5/14/2019)

(x * x^T) * (W * x)

Python 3.7, Pytorch 1.0.0, fastai 1.0.52

The purpose of this repository is two-fold:

demonstrate improvements brought by the use of a self-attention layer in an image classification model.
introduce a new layer which I call SimpleSelfAttention, which is a modified version of the SelfAttention described in [4]

Updates

v0.3 (6/21/2019)

Changed the order of operations in SimpleSelfAttention (in xresnet.py), it should run much faster (see Self Attention Time Complexity.ipynb)
added fast.ai's csv logging in train.py

v0.2 (5/31/2019)

Original standalone notebook is now in folder "v0.1"
model is now in xresnet.py, training is done via train.py (both adapted from fastai repository)
Added option for symmetrical self-attention (thanks @mgrankin for the implementation)
Added support for multiple GPU (thanks to fastai)
Added option to run fastai's learning rate finder
Added option to use xresnet18 to xresnet152 baseline architectures

Note: we recommend starting with a single GPU, as running multiple GPU will require additional hyperparameter tuning.

How to run (see 'examples' notebook):

%run train.py --woof 1 --size 256 --bs 64 --mixup 0.2 --sa 1 --epoch 5 --lr 3e-3

woof: 0 for Imagenette, 1 for Imagewoof (dataset will download automatically)
size: image size
bs: batch size
mixup: 0 for no mixup data augmentation
sa: 1 if we use SimpleSelfAttention, otherwise 0
sym: 1 if we add symmetry to SimpleSelfAttention (need to have sa=1)
epoch: number of epochs
lr: learning rate
lrfinder: 1 to run learning rate finder, don't train
dump: 1 to print model, don't train
arch: default is 'xresnet50'
gpu: gpu to train on (by default uses all available GPUs??)
log: name of csv file to save training log to (folder path is displayed when running)

For faster training on multiple GPUs, you can try running: python -m fastai.launch train.py (not tested much)

Image classification results (work in progress)

We compare a baseline resnet model to the same model with an extra self-attention layer (SimpleSelfAttention, which I will describe further down).

Same run time ~50 epochs test (xresnet18, 128px, Imagewoof dataset[1])

1) We first run the original xresnet18 model for 50 epochs with a range of learning rates and pick the best one:

Model	Dataset	Image Size	Epochs	Learning Rate	# of runs	Avg (Max Accuracy)
xresnet18	Imagewoof	128	50	1e-3	10	0.821
xresnet18	Imagewoof	128	50	3e-3	30	0.845
xresnet18	Imagewoof	128	50	5e-3	10	0.846
xresnet18	Imagewoof	128	50	8e-3	20	0.850
xresnet18	Imagewoof	128	50	1e-2	20	0.846
xresnet18	Imagewoof	128	50	12e-3	20	0.844
xresnet18	Imagewoof	128	50	14e-3	20	0.847

Note: we are not using mixup.

2) We pick a number of epochs for our xresnet18+SimpleSelfAttention model that gives the same runtime or less as the baseline model and use the learning rate from step 1

Results using the original self-attention layer are added as a reference.

Model	Dataset	Image Size	Epochs	Learning Rate	# of runs	Avg (Max Accuracy)	Stdev (Max Accuracy)	Avg Wall Time (# of obs)
xresnet18	Imagewoof	128	50	8e-3	20	0.8498	0.00782	9:37 (4)
xresnet18 + simple sa	Imagewoof	128	47	8e-3	20	0.8567	0.00937	9:28 (4)
xresnet18 + original sa	Imagewoof	128	47	8e-3	20	0.8547	0.00652	11:20 (1)

This is using a single RTX 2080 Ti GPU. We use the %%time function on Jupyter notebooks.

Parameters:

%run train.py --woof 1 --size 128 --bs 64 --mixup 0 --sa 0 --epoch 50 --lr 8e-3 --arch 'xresnet18'

%run train.py --woof 1 --size 128 --bs 64 --mixup 0 --sa 1 --epoch 47 --lr 8e-3 --arch 'xresnet18'

We can compare the results using an independent samples t-test (https://www.medcalc.org/calc/comparison_of_means.php):

Difference: 0.007
95% confidence interval: 0.0014 to 0.0124
Significance level: P = 0.0157

Adding a SimpleSelfAttention layer seems to provide a statistically significant boost in accuracy after training for ~50 epochs, without additional run time, and while using a learning rate optimized for the original model.

SimpleSelfAttention provides similar results as the original SelfAttention, while decreasing run time.

Same run time ~100 epochs test (xresnet18, 128px, Imagewoof dataset[1])

We use the same parameters as for 50 epochs and double the number of epochs:

Model	Dataset	Image Size	Epochs	Learning Rate	# of runs	Avg (Max Accuracy)	Stdev (Max Accuracy)	Avg Wall Time(# of obs)
xresnet18	Imagewoof	128	100	8e-3	23	0.8576	0.00817	20:05 (4)
xresnet18 + simple sa	Imagewoof	128	94	8e-3	23	0.8634	0.00740	19:27 (4)

Difference: 0.006
95% CI 0.0012 to 0.0104
Significance level P = 0.0153

~100 epochs test with Mixup=0.2 (xresnet18, 128px, Imagewoof dataset[1])

Model	Dataset	Image Size	Epochs	Learning Rate	# of runs	Avg (Max Accuracy)	Stdev (Max Accuracy)	Avg Wall Time(# of obs)
xresnet18	Imagewoof	128	100	8e-3	15	0.8636	0.00585	?
xresnet18 + simple sa	Imagewoof	128	94	8e-3	15	0.87106	0.00726	?
xresnet18 + original sa	Imagewoof	128	94	8e-3	15	0.8697	0.00726	?

Again here, SimpleSelfAttention performs as well as the original self-attention layer and beats the baseline model.

~50 epochs , 256px images, Mixup = 0.2

Model	Dataset	Image Size	Epochs	Learning Rate	# of runs	Avg (Max Accuracy)	Stdev (Max Accuracy)	Avg Wall Time(# of obs)
xresnet18	Imagewoof	256	50	8e-3	15	0.9005	0.00595	_
xresnet18 + simple sa	Imagewoof	256	47	8e-3	15	0.9002	0.00478	_

So far, no detected improvement when using 256px wide images.

Simple Self Attention layer

The only difference between baseline and proposed model is the addition of a self-attention layer at a specific position in the architecture.

The new layer, which I call SimpleSelfAttention, is a modified and simplified version of the fastai implementation ([3]) of the self attention layer described in the SAGAN paper ([4]).

Original layer:

 class SelfAttention(nn.Module):

  "Self attention layer for nd."

  def __init__(self, n_channels:int):
      super().__init__()
      self.query = conv1d(n_channels, n_channels//8)
      self.key   = conv1d(n_channels, n_channels//8)
      self.value = conv1d(n_channels, n_channels)
      self.gamma = nn.Parameter(tensor([0.]))

  def forward(self, x):
      #Notation from https://arxiv.org/pdf/1805.08318.pdf
      size = x.size()
      x = x.view(*size[:2],-1)
      f,g,h = self.query(x),self.key(x),self.value(x)
      beta = F.softmax(torch.bmm(f.permute(0,2,1).contiguous(), g), dim=1)
      o = self.gamma * torch.bmm(h, beta) + x
      return o.view(*size).contiguous()

Proposed layer:

Edit (6/21/2019): order of operations matters to reduce complexity! Changed from x * (x^T * (conv(x))) to (x * x^T) * conv(x)

class SimpleSelfAttention(nn.Module):

def __init__(self, n_in:int, ks=1):#, n_out:int):
    super().__init__()           
    self.conv = conv1d(n_in, n_in, ks, padding=ks//2, bias=False)    
    self.gamma = nn.Parameter(tensor([0.]))       
    self.sym = sym
    self.n_in = n_in
    
def forward(self,x):               
              
    size = x.size()  
    x = x.view(*size[:2],-1)   # (C,N)             
    
    convx = self.conv(x)   # (C,C) * (C,N) = (C,N)   => O(NC^2)
    xxT = torch.bmm(x,x.permute(0,2,1).contiguous())   # (C,N) * (N,C) = (C,C)   => O(NC^2)
    
    o = torch.bmm(xxT, convx)   # (C,C) * (C,N) = (C,N)   => O(NC^2)
      
    o = self.gamma * o + x
    
      
    return o.view(*size).contiguous()

An important tip for convergence:

Convergence can be an issue when adding a SimpleSelfAttention layer to an existing architecture. We've observed that, when placed within a Resnet block, the network converges if SimpleSelfAttention is placed right after a convolution layer that uses batch norm, and initializes the batchnorm weights to 0. In our code (xresnet.py), this is done by setting zero_bn=True for the conv_layer that precedes SImpleSelfAttention.

Some more info (needs to be rewritten)

As described in the SAGAN paper ([4]), the original layer takes the image features x of shape (C,N) (where N = H * W), and transforms them into f(x) = Wf * x and g(x) = Wg * x, where Wf and Wg have shape (C,C'), and C' is chosen to be C/8. Those matrix multiplications can be expressed as (1 * 1) convolution layers. Then, we compute S = (f(x))^T * g(x).

Therefore, S = (Wf * x)^T * (Wg * x) = x^T * (Wf ^T * Wg) * x. My first proposed simplification is to combine (Wf ^T * Wg) into a single (C * C) matrix W. So S = x^T * W * x. S = S(x,x) (bilinear form) is of shape (N * N) and will represent the influence of each pixel on other pixels ("the extent to which the model attends to the ith location when synthesizing the jth region" [4]). Note that S(x,x) depends on the input, whereas W does not. (I suspect that having the same bilinear form for every input might be the reason we do better on Imagewoof = 10 dog breeds than Imagenette = 10 very different classes)

Thus, we only learn weights W for one convolution layer instead of weights Wf and Wg for two convolution layers. Advantages are: simplicity, removal of one design choice (C' = C/8), and a matrix W that offers more possibilities than Wf ^T * Wg. One possible drawback is that we have more parameters to learn (C^2 vs C^2/4). One option we haven't tried here is to force W to be a symmetrical matrix. This would reduce the number of parameters and force the influence of "pixel" j on pixel i to be the same as pixel i on pixel j.

Edit: @mgrankin tested symmetry and got a small improvement [5]

The next step in the original version of the layer is to compute the softmax of matrix S. I decided to remove this step completely and work with unrestricted weights instead of normalized probability-like weights.

The final step in the original version is to compute h(x) = Wh * x (Wh of shape (C * C)), which is also implemented as a 1 * 1 convolution layer. Then our final output is o = gamma * h(x) * S + x. We propose to remove this final convolution layer and have the output be o = gamma * x * S + x. This final convolution could be re-added as a separate layer if desired, although this implies a different position for the skip connection.