leaderj1001/Stand-Alone-Self-Attention

Stars
454
Rank 92,751 (Top 2 %)
Language
Python
License
MIT License
Created almost 5 years ago
Updated about 4 years ago

leaderj1001/Stand-Alone-Self-Attention

leaderj1001

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Implementing Stand-Alone Self-Attention in Vision Models using Pytorch

Implementing Stand-Alone Self-Attention in Vision Models using Pytorch (13 Jun 2019)

Stand-Alone Self-Attention in Vision Models paper
Author:
- Prajit Ramachandran (Google Research, Brain Team)
- Niki Parmar (Google Research, Brain Team)
- Ashish Vaswani (Google Research, Brain Team)
- Irwan Bello (Google Research, Brain Team)
- Anselm Levskaya (Google Research, Brain Team)
- Jonathon Shlens (Google Research, Brain Team)
Awesome :)

Method

Attention Layer
- Equation 1:
Relative Position Embedding
- The row and column offsets are associated with an embedding and respectively each with dimension . The row and column offset embeddings are concatenated to form . This spatial-relative attention is now defined as below equation.
- Equation 2:
- I refer to the following paper when implementing this part.
  - Attention Augemnted Convolutional Networks paper

Replacing Spatial Convolutions
- A 2 × 2 average pooling with stride 2 operation follows the attention layer whenever spatial downsampling is required. - This work applies the transform on the ResNet family of architectures. The proposed transform swaps the 3 × 3 spatial convolution with a self-attention layer as defined in Equation 3.
Replacing the Convolutional Stem
- The initial layers of a CNN, sometimes referred to as the stem, play a critical role in learning local features such as edges, which later layers use to identify global objects. - The stem performs self-attention within each 4 × 4 spatial block of the original image, followed by batch normalization and a 4 × 4 max pool operation.

Experiments

Setup

Spatial extent: 7
Attention heads: 8
Layers:
- ResNet 26: [1, 2, 4, 1]
- ResNet 38: [2, 3, 5, 2]
- ResNet 50: [3, 4, 6, 3]

Datasets	Model	Accuracy	Parameters (My Model, Paper Model)
CIFAR-10	ResNet 26	90.94%	8.30M, -
CIFAR-10	Naive ResNet 26	94.29%	8.74M
CIFAR-10	ResNet 26 + stem	90.22%	8.30M, -
CIFAR-10	ResNet 38 (WORK IN PROCESS)	89.46%	12.1M, -
CIFAR-10	Naive ResNet 38	94.93%	15.0M
CIFAR-10	ResNet 50 (WORK IN PROCESS)		16.0M, -
IMAGENET	ResNet 26 (WORK IN PROCESS)		10.3M, 10.3M
IMAGENET	ResNet 38 (WORK IN PROCESS)		14.1M, 14.1M
IMAGENET	ResNet 50 (WORK IN PROCESS)		18.0M, 18.0M

Usage

Requirements

torch==1.0.1

Todo

Experiments
IMAGENET
Review relative position embedding, attention stem
Code Refactoring

Reference

Attention-Augmented-Conv2d

Implementing Attention Augmented Convolutional Networks using Pytorch

MobileNetV3-Pytorch

Implementing Searching for MobileNetV3 paper using Pytorch

BottleneckTransformers

Bottleneck Transformers for Visual Recognition

LambdaNetworks

Implementing Lambda Networks using Pytorch

Billion-scale-semi-supervised-learning

Implementing Billion-scale semi-supervised learning for image classification using Pytorch

RandWireNN

Implementing Randomly Wired Neural Networks for Image Recognition, Using CIFAR-10 dataset, CIFAR-100 dataset

Jupyter Notebook

Synthesizer-Rethinking-Self-Attention-Transformer-Models

Implementing SYNTHESIZER: Rethinking Self-Attention in Transformer Models using Pytorch

CLIP

CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

Mixed-Depthwise-Convolutional-Kernels

Implementing MixNet: Mixed Depthwise Convolutional Kernels using Pytorch

SimSiam

Exploring Simple Siamese Representation Learning

Action-Localization

Action-Localization, Atomic Visual Actions (AVA) Dataset

Bag-of-MLP

PSPNet

Implementing Pyramid Scene Parsing Network (PSPNet) paper using Pytorch

DiffusionModel

Re-implementating Diffusion model using Pytorch

AssembleNet

Implementing AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures Explain using Pytorch

OmniNet

OmniNet: Omnidirectional Representations from Transformers

Backpropagation-CNN-basic

Graph-Convolutional-Network

Phasic-Policy-Gradient

Phasic-Policy-Gradient

bag-of-rl

Bag of Reinforcement Learning Algorithm

minimal-BERT

Bidirectional Encoder Representations from Transformers

Vision-Language

Vision-Language, Solve GQA(Visual Reasoning in the Real World) dataset.

minimal-cyclegan

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Transformer

Implementing Attention Is All You Need paper. Transformer Model

minimal-stylegan

SlowFast

SlowFast Network

minimal-segmentation

minimal-segmentation