Sharpened Cosine Similarity
An alternative to convolution for neural networks
Implementations
Description
Sharpened cosine similarity is a strided operation, like convolution, that extracts features from an image.
It is related to convolution, but with important defferences. Convolution is a strided dot product between a signal, s, and a kernel k.
A cousin of convolution is cosine similarity, where the signal patch and kernel are both normalized to have a magnitude of 1 before the dot product is taken. It is so named because in two dimensions, it gives the cosine of the angle between the signal and the kernel vectors.
The cosine is known for being broad, that is, two quite different vectors can have a moderately high cosine similarity. It can be sharpened by raising the magnitude of the result to a power, p, while maintaining the sign.
This measure can become numerically unstable if ever the magnitude of the signal or kernel gets too close to zero. Adding a small value, q, to the signal magnitude. In practice, the kernel magnitude doesn't get too small and doesn't need this term.
Background
The idea behind sharpened cosine similarity first surfaced as a Twitter thread in 2020. There's some more development in this blog post.
Tips and Tricks
These are some things that have been reported to work so far.
- The big benefit of SCS appears to be parameter efficiency and architecture simplicity. It doesn't look like it's going to beat any accuracy records, and it doesn't always run very fast, but it's killing in this parameter efficiency leaderboard.
- Skip the nonlinear activation layers, like ReLU and sigmoid, after SCS layers.
- Skip the dropout layers after SCS layers.
- Skip the normalization layers, like batch normalization or layer normalization, after SCS layers.
- Use MaxAbsPool instead of MaxPool. It selects the element with the highest magnitude of activity, even if it's negative.
- Raising activities to the power p generally doesn't parallelize well on GPUs and TPUs. It will slow your code down a LOT compared to straight convolutions. Disabling the p parameters results in a huge speedup on GPUs, but this takes the "sharpened" out of SCS. Regular old cosine similarity is cool, but it is its own thing with its own limitations.
Examples
In the age of gargantuan language models, it's uncommon to talk about how few parameters a model uses, but it matters when you hope to deploy on compute- or power-limited devices. Sharpened cosine similarity is exceptionally parameter efficient.
The repository scs_torch_gallery
has a handful of working examples.
cifar10_80_25214.py
is an image
classification model that gets 80% accuracy on CIFAR 10, using only 25.2k parameters.
According to the CIFAR-10 Papers With Code
this is somewhere around one-tenth of the parameters in previous models in this accuracy range.
Reverse Chronology
Date | Milestone |
---|---|
2022-12-06 | Paper by Skyler Wu, Fred Lu, Edward Raff, James Holt in NeurIPS 2022 ICBINB Workshop |
2022-04-23 | Code by Steven Walton. SCS in Compact Transformers. |
2022-03-28 | Code by Raphael Pisoni. Jax implementation. |
2022-03-11 | Code by Phil Sodmann. PyTorch Lightning demo on the Fashion MNIST data. |
2022-02-25 | Experiments and analysis by Lucas Nestler . TPU implementation of SCS. Runtime performance comparison with and without the p parameter |
2022-02-24 | Code by Dr. John Wagner. Head to head comparison with convnet on American Sign Language alphabet dataset. |
2022-02-22 | Code by Håkon Hukkelås. Reimplementation of SCS in PyTorch with a performance boost from using Conv2D. Achieved 91.3% CIFAR-10 accuracy with a model of 1.2M parameters. |
2022-02-21 | Code by Zimonitrome. An SCS-based GAN, the first of its kind. |
2022-02-20 | Code by Michał Tyszkiewicz. Reimplementation of SCS in PyTorch with a performance boost from using Conv2D. |
2022-02-20 | Code by Lucas Nestler. Reimplementation of SCS in PyTorch with a performance boost and CUDA optimizations. |
2022-02-18 | Blog post by Raphael Pisoni. SOTA parameter efficiency on MNIST. Intuitive feature interpretation. |
2022-02-01 | PyTorch code by Stephen Hogg. PyTorch implementation of SCS. MaxAbsPool implementation. |
2022-02-01 | PyTorch code by Oliver Batchelor. PyTorch implementation of SCS. |
2022-01-31 | PyTorch code by Ze Wang. PyTorch implementation of SCS. |
2022-01-30 | Keras code by Brandon Rohrer. Keras implementation of SCS running on Fashion MNIST. |
2022-01-17 | Code by Raphael Pisoni. Implementation of SCS in paired depthwise/pointwise configuration, the key element of the ConvMixer architecture. |
2022-01-06 | Keras code by Raphael Pisoni. Keras implementation of SCS. |
2020-02-24 | Twitter thread by Brandon Rohrer. Justification and introduction of SCS. |