• Stars
    star
    99
  • Rank 343,315 (Top 7 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Official repo for ECCV 2020 paper - RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

This repository contains the official PyTorch implementation with accelerated CUDA kernels for our paper:

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition
ECCV 2020
Linxi (Jim) Fan*, Shyamal Buch*, Guanzhi Wang, Ryan Cao, Yuke Zhu, Juan Carlos Niebles, Li Fei-Fei
(* denotes equal contribution lead author)

Quick Links: [paper] [project website] [video] [eccv page] [supplementary] [code]

Abstract

framework

Video action recognition is a complex task dependent on modeling spatial and temporal context. Standard approaches rely on 2D or 3D convolutions to process such context, resulting in expensive operations with millions of parameters. Recent efficient architectures leverage a channel-wise shift-based primitive as a replacement for temporal convolutions, but remain bottlenecked by spatial convolution operations to maintain strong accuracy and a fixed-shift scheme. Naively extending such developments to a 3D setting is a difficult, intractable goal.

To this end, we introduce RubiksNet, a new efficient architecture for video action recognition based on a proposed learnable 3D spatiotemporal shift operation (RubiksShift). We analyze the suitability of our new primitive for video action recognition and explore several novel variations of our approach to enable stronger representational flexibility while maintaining an efficient design. We benchmark our approach on several standard video recognition datasets, and observe that our method achieves comparable or better accuracy than prior work on efficient video action recognition at a fraction of the performance cost, with 2.9 - 5.9x fewer parameters and 2.1 - 3.7x fewer FLOPs. We also perform a series of controlled ablation studies to verify our significant boost in the efficiency-accuracy tradeoff curve is rooted in the core contributions of our RubiksNet architecture.

Installation

Tested with:

  • Ubuntu 18.04
  • PyTorch >= 1.5
  • CUDA 10.1
# Create your virtual environment
conda create --name rubiksnet python=3.7
conda activate rubiksnet

# Install PyTorch and supporting libraries
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
conda install scikit-learn

# Clone this repo
git clone https://github.com/stanfordvl/rubiksnet.git
cd rubiksnet

# Compiles our efficient CUDA-based RubiksShift operator under the hood
# and installs the main API
pip install -e .

To test if the installation is successful, please run python scripts/test_installation.py. You should see a random prediction followed by "Installation successful!".

Usage

It is very simple to get started with our API:

from rubiksnet.models import RubiksNet

# `tier` must be one of ["tiny", "small", "medium", "large"]
# `variant` must be one of ["rubiks3d", "rubiks3d-aq"]
# 174 is the number of classes for Something-Something-V2 action classification

# instantiate RubiksNet-Tiny with random weights
net = RubiksNet(tier="tiny", num_classes=174, variant="rubiks3d")

# instantiate RubiksNet-Large network with temporal attention quantized shift 
net = RubiksNet(tier="large", num_classes=174, variant="rubiks3d-aq")

# load RubiksNet-Large model from pretrained weights
net = RubiksNet.load_pretrained("pretrained/ssv2_large.pth.tar")

From here, net contains a RubiksNet model and can be used like any other PyTorch model! See our inference script for example usage.

Pretrained Models

Something-Something-V2

For the Something-Something-V2 benchmark, we follow the evaluation convention in TSM and report results from two evaluation protocols. For "1-Clip Val Acc", we sample only a single clip per video and the center 224×224 crop for evaluation. For "2-Clip Val Acc", we sample 2 clips per video and take 3 equally spaced 224×224 crops from the full resolution image scaled to 256 pixels on the shorter side.

Model Input 2-Clip Top-1 2-Clip Top-5 #Param. FLOPs Test Log Pretrained
RubiksNet-
Large-
AQ (Budget=0.125)
8 61.6 86.7 8.5M 15.7G 1-clip
2-clip
model link
RubiksNet-
Large
8 61.7 87.3 8.5M 15.8G 1-clip
2-clip
model link
RubiksNet-
Medium
8 60.8 86.9 6.2M 11.2G 1-clip
2-clip
model link
RubiksNet-
Small
8 59.8 86.2 3.6M 6.8G 1-clip
2-clip
model link
RubiksNet-
Tiny
8 56.7 84.1 1.9M 3.9G 1-clip
2-clip
model link

Kinetics

We also provide pretrained models on the Kinetics dataset, which follow the pretraining protocol in prior work -- see the supplementary material for details. All four tiers of RubiksNet can be found at pretrained/kinetics_{large,medium,small,tiny}.pth.tar.

Our CUDA implementation includes accelerated gradient calculation on GPUs. We provide an example script to finetune the pretrained kinetics checkpoints on your own dataset.

python scripts/example_finetune.py --gpu 0 --pretrained-path pretrained/kinetics_tiny.pth.tar

The script contains a dummy dataset that generates random videos. You can replace it with your own data loader. If you run the above script, you should see RubiksNet-tiny gradually overfitting the artificial training data.

With minor modifications, RubiksNet should be largely compatible with video action recognition pipelines in other related repos.

Testing

Please refer to TSM repo for how to prepare the Something-Something-V2 test data. We assume the processed dataset is located at <root_path>/somethingv2

To test "2-Clip Val Acc" of the pretrained SomethingV2 models, you can run

# test RubiksNet-Large
python test_models.py somethingv2 
	--root-path=<root_path_to_somethingv2_dataset> \
	--pretrained=pretrained/ssv2_large.pth.tar \
	--two-clips \
	--batch-size=80 -j 8 

To test "1-Clip Val Acc", you can run

# test RubiksNet-Large
python test_models.py somethingv2 \
	--root-path=<root_path_to_somethingv2_dataset> \
	--pretrained=pretrained/ssv2_large.pth.tar \
	--batch-size=80 -j 8 

Citation

If you find this code useful, please cite our ECCV paper:

@inproceedings{fanbuch2020rubiks,
  title={RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition},
  author={Linxi Fan* and Shyamal Buch* and Guanzhi Wang and Ryan Cao and Yuke Zhu and Juan Carlos Niebles and Li Fei-Fei},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2020}
}

LICENSE

We release our code here under the open MIT License. Our contact information can be found in the paper and on our project website.

Acknowledgements

This research was sponsored in part by grants from Toyota Research Institute (TRI). Some computational support for experiments was provided by Google Cloud and NVIDIA. The authors also acknowledge fellowship support. Please refer to our paper for full acknowledgements, thank you!

We reference code from the excellent repos of Temporal Segment Network, Temporal Shift Module, ShiftResNet, and ActiveShift. Please be sure to cite these works/repos as well.

More Repositories

1

GibsonEnv

Gibson Environments: Real-World Perception for Embodied Agents
C
864
star
2

taskonomy

Taskonomy: Disentangling Task Transfer Learning [Best Paper, CVPR2018]
Python
845
star
3

cs131_notes

Class notes for CS 131.
TeX
736
star
4

iGibson

A Simulation Environment to train Robots in Large Realistic Interactive Scenes
Python
656
star
5

CS131_release

Released assignments for the Stanford's CS131 course on Computer Vision.
Jupyter Notebook
454
star
6

OmniGibson

OmniGibson: a platform for accelerating Embodied AI research built upon NVIDIA's Omniverse engine. Join our Discord for support: https://discord.gg/bccR5vGFEx
Python
425
star
7

ReferringRelationships

Python
260
star
8

3DSceneGraph

The data skeleton from "3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera" http://3dscenegraph.stanford.edu
Python
237
star
9

JRMOT_ROS

Source code for JRMOT: A Real-Time 3D Multi-Object Tracker and a New Large-Scale Dataset
Python
145
star
10

feedback-networks

The repo of Feedback Networks, CVPR17
Lua
89
star
11

ntp

Neural Task Programming
81
star
12

STR-PIP

Spatiotemporal Relationship Reasoning for Pedestrian Intent Prediction
Python
74
star
13

bddl

Jupyter Notebook
67
star
14

robovat

RoboVat: A unified toolkit for simulated and real-world robotic task environments.
Python
67
star
15

iGibsonChallenge2021

Python
55
star
16

behavior

Code to evaluate a solution in the BEHAVIOR benchmark: starter code, baselines, submodules to iGibson and BDDL repos
Python
52
star
17

atp-video-language

Official repo for CVPR 2022 (Oral) paper: Revisiting the "Video" in Video-Language Understanding. Contains code for the Atemporal Probe (ATP).
Python
47
star
18

GibsonSim2RealChallenge

GibsonSim2RealChallenge @ CVPR2020
Python
35
star
19

moma

A dataset for multi-object multi-actor activity parsing
Jupyter Notebook
34
star
20

NTP-vat-release

The PyBullet wrapper (Vat) for Neural Task Programming
Python
34
star
21

mini_behavior

MiniGrid Implementation of BEHAVIOR Tasks
Python
28
star
22

BehaviorChallenge2021

Python
25
star
23

HMS

The repository of the code base of "Multi-Layer Semantic and Geometric Modeling with Neural Message Passing in 3D Scene Graphs for Hierarchical Mechanical Search"
Python
25
star
24

ac-teach

Code for the CoRL 2019 paper AC-Teach: A Bayesian Actor-Critic Method for Policy Learning with an Ensemble of Suboptimal Teachers
Python
24
star
25

STGraph

Codebase for CVPR 2020 paper "Spatio-Temporal Graph for Video Captioning with Knowledge Distillation"
22
star
26

cavin

Python
20
star
27

alignment

ELIGN: Expectation Alignment as a Multi-agent Intrinsic Reward
Python
19
star
28

Sonicverse

HTML
17
star
29

Gym

Custom version of OpenAI Gym
Python
14
star
30

causal_induction

Codebase for "Causal Induction from Visual Observations for Goal-Directed Tasks"
Python
12
star
31

keto

Python
12
star
32

Lasersuite

Forked robosuite for LASER project
Python
11
star
33

perls2

PErception and Robotic Learning System v2
Python
11
star
34

STIP

Python
10
star
35

behavioral_navigation_nlp

Code for translating navigation instructions in natural language to a high-level plan for behavioral navigation for robot navigation
Python
9
star
36

bullet3

C++
8
star
37

arxivbot

Python
8
star
38

egl_probe

A helpful module for listing available GPUs for EGL rendering.
C
6
star
39

ssai

Socially Situated AI
4
star
40

ig_navigation

Python
4
star
41

omnigibson-eccv-tutorial

Jupyter Notebook
4
star
42

RL-Pseudocode

AppleScript
4
star
43

ARPL

Adversarially Robust Policy Learning
Python
4
star
44

sail-blog-new-post

The repository for making new post submissions to the SAIL Blog
HTML
3
star
45

behavior-website-old

HTML
2
star
46

behavior-baselines

Python
2
star
47

behavior-website

SCSS
1
star
48

iris

IRIS: Implicit Reinforcement without Interaction at Scale for Control from Large-Scale Robot Manipulation Datasets
1
star
49

bullet3_ik

Pybullet frozen at version 1.9.5 - purely for using its IK implementation.
C++
1
star