RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

This repository contains the official PyTorch implementation with accelerated CUDA kernels for our paper:

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition
ECCV 2020
Linxi (Jim) Fan*, Shyamal Buch*, Guanzhi Wang, Ryan Cao, Yuke Zhu, Juan Carlos Niebles, Li Fei-Fei
(* denotes equal contribution lead author)

Quick Links: [paper] [project website] [video] [eccv page] [supplementary] [code]

Abstract

Video action recognition is a complex task dependent on modeling spatial and temporal context. Standard approaches rely on 2D or 3D convolutions to process such context, resulting in expensive operations with millions of parameters. Recent efficient architectures leverage a channel-wise shift-based primitive as a replacement for temporal convolutions, but remain bottlenecked by spatial convolution operations to maintain strong accuracy and a fixed-shift scheme. Naively extending such developments to a 3D setting is a difficult, intractable goal.

To this end, we introduce RubiksNet, a new efficient architecture for video action recognition based on a proposed learnable 3D spatiotemporal shift operation (RubiksShift). We analyze the suitability of our new primitive for video action recognition and explore several novel variations of our approach to enable stronger representational flexibility while maintaining an efficient design. We benchmark our approach on several standard video recognition datasets, and observe that our method achieves comparable or better accuracy than prior work on efficient video action recognition at a fraction of the performance cost, with 2.9 - 5.9x fewer parameters and 2.1 - 3.7x fewer FLOPs. We also perform a series of controlled ablation studies to verify our significant boost in the efficiency-accuracy tradeoff curve is rooted in the core contributions of our RubiksNet architecture.

Installation

Tested with:

Ubuntu 18.04
PyTorch >= 1.5
CUDA 10.1

# Create your virtual environment
conda create --name rubiksnet python=3.7
conda activate rubiksnet

# Install PyTorch and supporting libraries
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
conda install scikit-learn

# Clone this repo
git clone https://github.com/stanfordvl/rubiksnet.git
cd rubiksnet

# Compiles our efficient CUDA-based RubiksShift operator under the hood
# and installs the main API
pip install -e .

To test if the installation is successful, please run python scripts/test_installation.py. You should see a random prediction followed by "Installation successful!".

Usage

It is very simple to get started with our API:

from rubiksnet.models import RubiksNet

# `tier` must be one of ["tiny", "small", "medium", "large"]
# `variant` must be one of ["rubiks3d", "rubiks3d-aq"]
# 174 is the number of classes for Something-Something-V2 action classification

# instantiate RubiksNet-Tiny with random weights
net = RubiksNet(tier="tiny", num_classes=174, variant="rubiks3d")

# instantiate RubiksNet-Large network with temporal attention quantized shift 
net = RubiksNet(tier="large", num_classes=174, variant="rubiks3d-aq")

# load RubiksNet-Large model from pretrained weights
net = RubiksNet.load_pretrained("pretrained/ssv2_large.pth.tar")

From here, net contains a RubiksNet model and can be used like any other PyTorch model! See our inference script for example usage.

Pretrained Models

Something-Something-V2

For the Something-Something-V2 benchmark, we follow the evaluation convention in TSM and report results from two evaluation protocols. For "1-Clip Val Acc", we sample only a single clip per video and the center 224×224 crop for evaluation. For "2-Clip Val Acc", we sample 2 clips per video and take 3 equally spaced 224×224 crops from the full resolution image scaled to 256 pixels on the shorter side.

Model	Input	2-Clip Top-1	2-Clip Top-5	#Param.	FLOPs	Test Log	Pretrained
RubiksNet- Large- AQ (Budget=0.125)	8	61.6	86.7	8.5M	15.7G	1-clip 2-clip	model link
RubiksNet- Large	8	61.7	87.3	8.5M	15.8G	1-clip 2-clip	model link
RubiksNet- Medium	8	60.8	86.9	6.2M	11.2G	1-clip 2-clip	model link
RubiksNet- Small	8	59.8	86.2	3.6M	6.8G	1-clip 2-clip	model link
RubiksNet- Tiny	8	56.7	84.1	1.9M	3.9G	1-clip 2-clip	model link

Kinetics

We also provide pretrained models on the Kinetics dataset, which follow the pretraining protocol in prior work -- see the supplementary material for details. All four tiers of RubiksNet can be found at pretrained/kinetics_{large,medium,small,tiny}.pth.tar.

Our CUDA implementation includes accelerated gradient calculation on GPUs. We provide an example script to finetune the pretrained kinetics checkpoints on your own dataset.

python scripts/example_finetune.py --gpu 0 --pretrained-path pretrained/kinetics_tiny.pth.tar

The script contains a dummy dataset that generates random videos. You can replace it with your own data loader. If you run the above script, you should see RubiksNet-tiny gradually overfitting the artificial training data.

With minor modifications, RubiksNet should be largely compatible with video action recognition pipelines in other related repos.

Testing

Please refer to TSM repo for how to prepare the Something-Something-V2 test data. We assume the processed dataset is located at <root_path>/somethingv2

To test "2-Clip Val Acc" of the pretrained SomethingV2 models, you can run

# test RubiksNet-Large
python test_models.py somethingv2 
	--root-path=<root_path_to_somethingv2_dataset> \
	--pretrained=pretrained/ssv2_large.pth.tar \
	--two-clips \
	--batch-size=80 -j 8

To test "1-Clip Val Acc", you can run

# test RubiksNet-Large
python test_models.py somethingv2 \
	--root-path=<root_path_to_somethingv2_dataset> \
	--pretrained=pretrained/ssv2_large.pth.tar \
	--batch-size=80 -j 8

Citation

If you find this code useful, please cite our ECCV paper:

@inproceedings{fanbuch2020rubiks,
  title={RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition},
  author={Linxi Fan* and Shyamal Buch* and Guanzhi Wang and Ryan Cao and Yuke Zhu and Juan Carlos Niebles and Li Fei-Fei},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2020}
}

LICENSE

We release our code here under the open MIT License. Our contact information can be found in the paper and on our project website.

Acknowledgements

This research was sponsored in part by grants from Toyota Research Institute (TRI). Some computational support for experiments was provided by Google Cloud and NVIDIA. The authors also acknowledge fellowship support. Please refer to our paper for full acknowledgements, thank you!

We reference code from the excellent repos of Temporal Segment Network, Temporal Shift Module, ShiftResNet, and ActiveShift. Please be sure to cite these works/repos as well.

StanfordVL/RubiksNet

StanfordVL

Reviews

Repository Details