• Stars
    star
    260
  • Rank 152,063 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[ICCV2023] UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

UniFormerV2

This repo is the official implementation of "UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer". By Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang and Yu Qiao.

Update

07/14/2023

UniFormerV2 has been accepted by ICCV2023! 🎉

02/13/2023

UniFormerV2 has been integrated into MMAction2. Training code will be provided soon! 😄

11/20/2022

We give a video demo in hugging face. Have a try! 😄

11/19/2022

We give a blog in Chinese Zhihu.

11/18/2022

All the code, models and configs are provided. Don't hesitate to open an issue if you have any problem! 🙋🏻

Introduction

In UniFormerV2, we propose a generic paradigm to build a powerful family of video networks, by arming the pre-trained ViTs with efficient UniFormer designs. It inherits the concise style of the UniFormer block. But it contains brand- new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. teaser It gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400.

PWC PWC PWC PWC PWC PWC PWC PWC

Model Zoo

All the models can be found in MODEL_ZOO.

Instructions

See INSTRUCTIONS for more details about:

  • Environment installation
  • Dataset preparation
  • Training and validation

Cite Uniformer

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{li2022uniformerv2,
      title={UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer}, 
      author={Kunchang Li and Yali Wang and Yinan He and Yizhuo Li and Yi Wang and Limin Wang and Yu Qiao},
      year={2022},
      eprint={2211.09552},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Acknowledgement

This repository is built based on UniFormer and SlowFast repository.

More Repositories

1

LLaMA-Adapter

[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
Python
5,526
star
2

DragGAN

Unofficial Implementation of DragGAN - "Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold" (DragGAN 全功能实现,在线Demo,本地部署试用,代码、模型已全部开源,支持Windows, macOS, Linux)
Python
4,967
star
3

InternGPT

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)
Python
3,123
star
4

Ask-Anything

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
Python
2,695
star
5

InternImage

[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Python
2,315
star
6

InternVideo

Video Foundation Models & Data for Multimodal Understanding
Python
954
star
7

InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V
Python
936
star
8

VisionLLM

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
554
star
9

OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
Python
549
star
10

VideoMamba

VideoMamba: State Space Model for Efficient Video Understanding
Python
506
star
11

GITM

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory
445
star
12

VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Python
375
star
13

Multi-Modality-Arena

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
Python
374
star
14

all-seeing

[ICLR 2024] This is the official implementation of the paper "The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World"
Python
373
star
15

CaFo

[CVPR 2023] Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
Python
323
star
16

PonderV2

PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
Python
302
star
17

DCNv4

[CVPR 2024] Deformable Convolution v4
Python
269
star
18

LAMM

[NeurIPS 2023 Datasets and Benchmarks Track] LAMM: Multi-Modal Large Language Models and Applications as AI Agents
Python
267
star
19

Instruct2Act

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model
Python
223
star
20

unmasked_teacher

[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Python
220
star
21

Vision-RWKV

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
Python
216
star
22

HumanBench

This repo is official implementation of HumanBench (CVPR2023)
Python
207
star
23

gv-benchmark

General Vision Benchmark, GV-B, a project from OpenGVLab
Python
187
star
24

InternVideo2

152
star
25

ControlLLM

ControlLLM: Augment Language Models with Tools by Searching on Graphs
Python
148
star
26

UniHCP

Official PyTorch implementation of UniHCP
Python
137
star
27

efficient-video-recognition

Python
114
star
28

SAM-Med2D

Official implementation of SAM-Med2D
Jupyter Notebook
114
star
29

ego4d-eccv2022-solutions

Champion Solutions for Ego4D Chanllenge of ECCV 2022
Jupyter Notebook
77
star
30

Awesome-DragGAN

Awesome-DragGAN: A curated list of papers, tutorials, repositories related to DragGAN
75
star
31

DiffRate

[ICCV 23]An approach to enhance the efficiency of Vision Transformer (ViT) by concurrently employing token pruning and token merging techniques, while incorporating a differentiable compression rate.
Jupyter Notebook
72
star
32

STM-Evaluation

Python
69
star
33

M3I-Pretraining

69
star
34

DDPS

Official Implementation of "Denoising Diffusion Semantic Segmentation with Mask Prior Modeling"
Python
53
star
35

MUTR

[AAAI 2024] Referred by Multi-Modality: A Unified Temporal Transformers for Video Object Segmentation
Python
52
star
36

Awesome-LLM4Tool

A curated list of the papers, repositories, tutorials, and anythings related to the large language models for tools
52
star
37

LORIS

Long-Term Rhythmic Video Soundtracker, ICML2023
Python
47
star
38

ChartAst

ChartAssistant is a chart-based vision-language model for universal chart comprehension and reasoning.
Python
46
star
39

Siamese-Image-Modeling

[CVPR 2023]Implementation of Siamese Image Modeling for Self-Supervised Vision Representation Learning
Python
32
star
40

InternVL-MMDetSeg

Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed
Jupyter Notebook
22
star
41

Multitask-Model-Selector

Implementation of Foundation Model is Efficient Multimodal Multitask Model Selector
Python
22
star
42

Official-ConvMAE-Det

Python
13
star
43

opengvlab.github.io

12
star
44

MovieMind

9
star
45

perception_test_iccv2023

Champion Solutions repository for Perception Test challenges in ICCV2023 workshop.
Python
9
star
46

EmbodiedGPT

5
star
47

DriveMLM

3
star
48

.github

2
star