• Stars
    star
    1,295
  • Rank 36,331 (Top 0.8 %)
  • Language
    Python
  • License
    Other
  • Created over 2 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Official PyTorch Implementation of VideoMAE (NeurIPS 2022 Spotlight).

VideoMAE Framework

License: CC BY-NC 4.0
Hugging Face ModelsHugging Face SpacesColab
PWC
PWC
PWC
PWC
PWC

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong, Yibing Song, Jue Wang, Limin Wang
Nanjing University, Tencent AI Lab

📰 News

[2023.4.18] 🎈Everyone can download Kinetics-400, which is used in VideoMAE, from this link.
[2023.4.18] Code and pre-trained models of VideoMAE V2 have been released! Check and enjoy this repo!
[2023.4.17] We propose EVAD, an end-to-end Video Action Detection framework.
[2023.2.28] Our VideoMAE V2 is accepted by CVPR 2023! 🎉
[2023.1.16] Code and pre-trained models for Action Detection in VideoMAE are available!
[2022.12.27] 🎈Everyone can download extracted VideoMAE features of THUMOS, ActivityNet, HACS and FineAction from InternVideo.
[2022.11.20] 👀 VideoMAE is integrated into Hugging Face Spaces and Colab, supported by @Sayak Paul.
[2022.10.25] 👀 VideoMAE is integrated into MMAction2, the results on Kinetics-400 can be reproduced successfully.
[2022.10.20] The pre-trained models and scripts of ViT-S and ViT-H are available!
[2022.10.19] The pre-trained models and scripts on UCF101 are available!
[2022.9.15] VideoMAE is accepted by NeurIPS 2022 as a spotlight presentation! 🎉
[2022.8.8] 👀 VideoMAE is integrated into official 🤗HuggingFace Transformers now! Hugging Face Models
[2022.7.7] We have updated new results on downstream AVA 2.2 benchmark. Please refer to our paper for details.
[2022.4.24] Code and pre-trained models are available now!
[2022.3.24] Code and pre-trained models will be released here. Welcome to watch this repository for the latest updates.

Highlights

🔥 Masked Video Modeling for Video Pre-Training

VideoMAE performs the task of masked video modeling for video pre-training. We propose the extremely high masking ratio (90%-95%) and tube masking strategy to create a challenging task for self-supervised video pre-training.

⚡️ A Simple, Efficient and Strong Baseline in SSVP

VideoMAE uses the simple masked autoencoder and plain ViT backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is much shorter than contrastive learning methods (3.2x speedup). VideoMAE can serve as a simple but strong baseline for future research in self-supervised video pre-training.

😮 High performance, but NO extra data required

VideoMAE works well for video datasets of different scales and can achieve 87.4% on Kinects-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51. To our best knowledge, VideoMAE is the first to achieve the state-of-the-art performance on these four popular benchmarks with the vanilla ViT backbones while doesn't need any extra data or pre-trained models.

🚀 Main Results

Something-Something V2

Method Extra Data Backbone Resolution #Frames x Clips x Crops Top-1 Top-5
VideoMAE no ViT-S 224x224 16x2x3 66.8 90.3
VideoMAE no ViT-B 224x224 16x2x3 70.8 92.4
VideoMAE no ViT-L 224x224 16x2x3 74.3 94.6
VideoMAE no ViT-L 224x224 32x1x3 75.4 95.2

Kinetics-400

Method Extra Data Backbone Resolution #Frames x Clips x Crops Top-1 Top-5
VideoMAE no ViT-S 224x224 16x5x3 79.0 93.8
VideoMAE no ViT-B 224x224 16x5x3 81.5 95.1
VideoMAE no ViT-L 224x224 16x5x3 85.2 96.8
VideoMAE no ViT-H 224x224 16x5x3 86.6 97.1
VideoMAE no ViT-L 320x320 32x4x3 86.1 97.3
VideoMAE no ViT-H 320x320 32x4x3 87.4 97.6

AVA 2.2

Please check the code and checkpoints in VideoMAE-Action-Detection.

Method Extra Data Extra Label Backbone #Frame x Sample Rate mAP
VideoMAE Kinetics-400 ViT-S 16x4 22.5
VideoMAE Kinetics-400 ViT-S 16x4 28.4
VideoMAE Kinetics-400 ViT-B 16x4 26.7
VideoMAE Kinetics-400 ViT-B 16x4 31.8
VideoMAE Kinetics-400 ViT-L 16x4 34.3
VideoMAE Kinetics-400 ViT-L 16x4 37.0
VideoMAE Kinetics-400 ViT-H 16x4 36.5
VideoMAE Kinetics-400 ViT-H 16x4 39.5
VideoMAE Kinetics-700 ViT-L 16x4 36.1
VideoMAE Kinetics-700 ViT-L 16x4 39.3

UCF101 & HMDB51

Method Extra Data Backbone UCF101 HMDB51
VideoMAE no ViT-B 91.3 62.6
VideoMAE Kinetics-400 ViT-B 96.1 73.3

🔨 Installation

Please follow the instructions in INSTALL.md.

➡️ Data Preparation

Please follow the instructions in DATASET.md for data preparation.

🔄 Pre-training

The pre-training instruction is in PRETRAIN.md.

⤴️ Fine-tuning with pre-trained models

The fine-tuning instruction is in FINETUNE.md.

📍Model Zoo

We provide pre-trained and fine-tuned models in MODEL_ZOO.md.

👀 Visualization

We provide the script for visualization in vis.sh. Colab notebook for better visualization is coming soon.

☎️ Contact

Zhan Tong: [email protected]

👍 Acknowledgements

Thanks to Ziteng Gao, Lei Chen, Chongjian Ge, and Zhiyu Zhao for their kind support.
This project is built upon MAE-pytorch and BEiT. Thanks to the contributors of these great codebases.

🔒 License

The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file. Portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license. BEiT is licensed under the MIT license.

✏️ Citation

If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:

@inproceedings{tong2022videomae,
  title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}

@article{videomae,
  title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  journal={arXiv preprint arXiv:2203.12602},
  year={2022}
}

More Repositories

1

MixFormer

[CVPR 2022 Oral & TPAMI 2024] MixFormer: End-to-End Tracking with Iterative Mixed Attention
Python
445
star
2

TDN

[CVPR 2021] TDN: Temporal Difference Networks for Efficient Action Recognition
Python
366
star
3

EMA-VFI

[CVPR 2023] Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolatio
Python
339
star
4

SparseBEV

[ICCV 2023] SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos
Python
328
star
5

MOC-Detector

[ECCV 2020] Actions as Moving Points
Python
264
star
6

AdaMixer

[CVPR 2022 Oral] AdaMixer: A Fast-Converging Query-Based Object Detector
Jupyter Notebook
236
star
7

CamLiFlow

[CVPR 2022 Oral & TPAMI 2023] Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion
Python
216
star
8

SparseOcc

[ECCV 2024] Fully Sparse 3D Occupancy Prediction & RayIoU Evaluation Metric
Python
199
star
9

MeMOTR

[ICCV 2023] MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking
Python
141
star
10

MixFormerV2

[NeurIPS 2023] MixFormerV2: Efficient Fully Transformer Tracking
Python
136
star
11

SportsMOT

[ICCV 2023] SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes
Python
133
star
12

SADRNet

[TIP 2021] SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction
Python
126
star
13

MultiSports

[ICCV 2021] MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions
Python
106
star
14

FCOT

[CVIU] Fully Convolutional Online Tracking
Python
91
star
15

MMN

[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding
Python
88
star
16

RTD-Action

[ICCV 2021] Relaxed Transformer Decoders for Direct Action Proposal Generation
Python
86
star
17

MOTIP

Multiple Object Tracking as ID Prediction
Python
84
star
18

BCN

[ECCV 2020] Boundary-Aware Cascade Networks for Temporal Action Segmentation
Python
84
star
19

LinK

[CVPR 2023] LinK: Linear Kernel for LiDAR-based 3D Perception
Python
81
star
20

MixSort

[ICCV2023] MixSort: The Customized Tracker in SportsMOT
Python
69
star
21

CPD-Video

Learning Spatiotemporal Features via Video and Text Pair Discrimination
Python
60
star
22

SGM-VFI

[CVPR 2024] Sparse Global Matching for Video Frame Interpolation with Large Motion
Python
59
star
23

Structured-Sparse-RCNN

[CVPR 2022] Structured Sparse R-CNN for Direct Scene Graph Generation
Jupyter Notebook
57
star
24

TRACE

[ICCV 2021] Target Adaptive Context Aggregation for Video Scene Graph Generation
Python
57
star
25

CRCNN-Action

Context-aware RCNN: a Baseline for Action Detection in Videos
Python
53
star
26

STMixer

[CVPR 2023] STMixer: A One-Stage Sparse Action Detector
Python
49
star
27

BasicTAD

BasicTAD: an Astounding RGB-Only Baselinefor Temporal Action Detection
Python
48
star
28

DDM

[CVPR 2022] Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection
Python
48
star
29

VideoMAE-Action-Detection

[NeurIPS 2022 Spotlight] VideoMAE for Action Detection
Python
47
star
30

MGSampler

[ICCV 2021] MGSampler: An Explainable Sampling Strategy for Video Action Recognition
Python
46
star
31

FSL-Video

[BMVC 2021] A Closer Look at Few-Shot Video Classification: A New Baseline and Benchmark
Python
39
star
32

BIVDiff

[CVPR 2024] BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
Python
39
star
33

PointTAD

[NeurIPS 2022] PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points
Python
37
star
34

TemporalPerceiver

[T-PAMI 2023] Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection
Python
34
star
35

TIA

[CVPR 2022] Task-specific Inconsistency Alignment for Domain Adaptive Object Detection
Python
33
star
36

CoMAE

[AAAI 2023] CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets
Python
31
star
37

PDPP

[CVPR 2023 Hightlight] PDPP: Projected Diffusion for Procedure Planning in Instructional Videos
Python
27
star
38

JoMoLD

[ECCV 2022] Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
Python
27
star
39

EVAD

[ICCV 2023] Efficient Video Action Detection with Token Dropout and Context Refinement
Python
24
star
40

CGA-Net

[CVPR 2021] CGA-Net: Category Guided Aggregation for Point Cloud Semantic Segmentation
Python
23
star
41

SSD-LT

[ICCV 2021] Self Supervision to Distillation for Long-Tailed Visual Recognition
Python
22
star
42

TREG

Target Transformed Regression for Accurate Tracking
Python
21
star
43

VFIMamba

VFIMamba: Video Frame Interpolation with State Space Models
Python
21
star
44

DEQDet

[ICCV 2023] Deep Equilibrium Object Detection
Jupyter Notebook
20
star
45

MGMAE

[ICCV 2023] MGMAE: Motion Guided Masking for Video Masked Autoencoding
Python
19
star
46

OCSampler

[CVPR 2022] OCSampler: Compressing Videos to One Clip with Single-step Sampling
Python
17
star
47

SportsHHI

[CVPR 2024] SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos
Python
11
star
48

APP-Net

[TIP] APP-Net: Auxiliary-point-based Push and Pull Operations for Efficient Point Cloud Recognition
Python
11
star
49

AMD

[CVPR 2024] Asymmetric Masked Distillation for Pre-Training Small Foundation Models
Python
11
star
50

StageInteractor

[ICCV 2023] StageInteractor: Query-based Object Detector with Cross-stage Interaction
Python
9
star
51

SPLAM

[ECCV 2024 Oral] SPLAM: Accelerating Image Generation with Sub-path Linear Approximation Model
Python
9
star
52

CMPT

[IJCV 2021] Cross-Modal Pyramid Translation for RGB-D Scene Recognition
Python
8
star
53

VLG

VLG: General Video Recognition with Web Textual Knowledge (https://arxiv.org/abs/2212.01638)
Python
8
star
54

DGN

[IJCV 2023] Dual Graph Networks for Pose Estimation in Crowded Scenes
Python
7
star
55

Dynamic-MDETR

[TPAMI 2024] Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Python
7
star
56

BFRNet

Python
6
star
57

ViT-TAD

[CVPR 2024] Adapting Short-Term Transformers for Action Detection in Untrimmed Videos
Python
6
star
58

VideoEval

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model
Python
6
star
59

ZeroI2V

[ECCV 2024] ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video
Python
5
star
60

PRVG

[CVIU 2024] End-to-end dense video grounding via parallel regression
Python
5
star
61

LogN

[IJCV 2024] Logit Normalization for Long-Tail Object Detection
Python
4
star