• Stars
    star
    219
  • Rank 180,080 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 3 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

An Image is Worth 16x16 Words, What is a Video Worth?

paper

Official PyTorch Implementation

Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor
DAMO Academy, Alibaba Group

Abstract

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with Γ—30 less frames per video, and Γ—40 faster inference than the current leading method

Update 2/5/2021: Improved results

Due to improved training hyperparameters, and using KD training, we were able to improve STAM results on Kinetics400 (+ ~1.5%). We are releasing the pretrained weights of the improved models (see Pretrained Models below).

Main Article Results

STAM models accuracy and GPU throughput on Kinetics400, compared to X3D. All measurements were done on Nvidia V100 GPU, with mixed precision. All models are trained on input resolution of 224.

Models Top-1 Accuracy
(%)
Flops Γ— views
(10^9)
# Input Frames Runtime
(Videos/sec)
X3D-M 76.0 6.2 Γ— 30 480 1.3
X3D-L 77.5 24.8 Γ— 30 480 0.46
X3D-XL 79.1 48.4 Γ— 30 480 N/A
X3D-XXL 80.4 194 Γ— 30 480 N/A
TimeSformer-L 80.7 2380 Γ— 3 288 N/A
ViViT-L 81.3 3992 Γ— 12 384 N/A
STAM-8 77.5 135 Γ— 1 8 ---
STAM-16 79.3 270 Γ— 1 16 20.0
STAM-32 79.95 540 Γ— 1 32 ---
STAM-64 80.5 1080 Γ— 1 64 4.8

Pretrained Models

We provide a collection of STAM models pre-trained on Kinetics400.

Model name checkpoint
STAM_8 link
STAM_16 link
STAM_32 link
STAM_64 link

Reproduce Article Scores

We provide code for reproducing the validation top-1 score of STAM models on Kinetics400. First, download pretrained models from the links above.

Then, run the infer.py script. For example, for stam_16 (input size 224) run:

python -m infer \
--val_dir=/path/to/kinetics_val_folder \
--model_path=/model/path/to/stam_16.pth \
--model_name=stam_16
--input_size=224

Citations

@misc{sharir2021image,
    title   = {An Image is Worth 16x16 Words, What is a Video Worth?}, 
    author  = {Gilad Sharir and Asaf Noy and Lihi Zelnik-Manor},
    year    = {2021},
    eprint  = {2103.13915},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Acknowledgements

We thank Tal Ridnik for discussions and comments.

Some components of this code implementation are adapted from the excellent repository of Ross Wightman. Check it out and give it a star while you are at it.

More Repositories

1

ASL

Official Pytorch Implementation of: "Asymmetric Loss For Multi-Label Classification"(ICCV, 2021) paper
Python
715
star
2

ImageNet21K

Official Pytorch Implementation of: "ImageNet-21K Pretraining for the Masses"(NeurIPS, 2021) paper
Python
713
star
3

TResNet

Official Pytorch Implementation of "TResNet: High-Performance GPU-Dedicated Architecture" (WACV 2021)
Python
465
star
4

ML_Decoder

Official PyTorch implementation of "ML-Decoder: Scalable and Versatile Classification Head" (2021)
Python
314
star
5

Solving_ImageNet

Official PyTorch implementation of the paper: "Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results" (2022)
Python
190
star
6

PartialLabelingCSL

Official implementation for the paper: "Multi-label Classification with Partial Annotations using Class-aware Selective Loss"
Python
127
star
7

AudioClassfication

Python
71
star
8

HardCoReNAS

Python
34
star
9

HeadSharingKD

Implementation of the paper "It's All in the Head: Representation Knowledge Distillation through Classifier Sharing"
Python
34
star
10

ZS_SDL

Official Pytorch Implementation of: "Semantic Diversity Learning for Zero-Shot Multi-label Classification"(ICCV, 2021) paper
Python
28
star
11

PETA

Official Pytorch Implementation of "PETA: Photo Albums Event Recognition using Transformers Attention" (2021)
Python
18
star
12

CobBO

Coordinate Backoff Bayesian Optimization
Python
8
star
13

alibaba-miil.github.io

Curated list of miil papers
7
star
14

knapsack_pruning

Python
3
star
15

BINAS

Constructing interpretable bilinear accuracy predictors to serve as an objective function for an IQCQP problem that represents NAS under latency constraints and solve it with efficient algorithms.
Python
3
star