• Stars
    star
    140
  • Rank 261,473 (Top 6 %)
  • Language
    Python
  • Created over 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[NeurIPS2022] Egocentric Video-Language Pretraining

EgoVLP: Egocentric Video-Language Pretraining

Project page | arXiv

TL;DR: We pioneer Egocentric Video-Language Pretraining from pretraining dataset, model and development benchmark; the resulted pretrained model exhibits strong performance on five downstream tasks across three egocentric datasets.

EgoVLP

๐Ÿ“ข News

๐Ÿ“ Preparation

Install dependencies

conda env create -f environment.yml
source activate egovlp

Ego4D videos and metadata

You can skip the source video download if pretraining is not required.

  1. Follow the guideline here, download the following to {PATH_TO_EGO4D}

    • Ego4D source videos (nearly 7 TB).
    • Ego4D videos metadata manifest.csv and benchmark metadata, e.g., nlq_train.json for NLQ.
    • Create the dir dataset and add a soft link by ln -s {PATH_TO_EGO4D} dataset/ego4d.
  2. For effectively pretraining, we compress videos in the following way:

    • Resize the source videos with a short size equal to 256 by script utils/video_resize.py.
    • Chunk the resized videos to multiple segments (up to 600 sec) by script utils/video_chunk.py.

EgoClip: an egocentric video-language pretraining dataset

  • Download the EgoClip metadata from here and put it to dataset/egoclip.csv.

  • For the usage of EgoClip, please see our dataloader data_loader/EgoClip_EgoMCQ_dataset.py. The data format of EgoClip is:

    import pandas as pd
    
    metadata = pd.read_csv('dataset/egoclip_metadata.csv', sep='\t', error_bad_lines=False)
    print(metadata.shape[0])
    print(metadata.iloc[0])
    
    # Out:
    3847723                                                         # Num of clips for EgoClip
    
    clip_idx                                                     0  # the idx of clip
    video_uid                 001e3e4e-2743-47fc-8564-d5efd11f9e90  # the uid of source video
    video_dur                                           128.033333  # the duration of source video
    narration_source                              narration_pass_1  # the source of annotator
    narration_ind                                                0  # the idx of narration
    narration_time                                          3.3445  # the narration timestamp
    clip_start                                            2.967651  # the start timestamp of clip
    clip_end                                              3.721266  # the end timestamp of clip
    clip_text           #C C picks a bag of clothes from the floor  # the narration of clip
    tag_verb                                                  [93]  # the verb idx of the narration
    tag_noun                                        [192, 115, 12]  # the noun idx of the narration

^ The terms tag_verb and tag_noun are used for EgoNCE pretraining objective, which considers synonyms. For example, pick, collect, gather are all belong to the verb parent with idx 93: take_(pick,_grab,_get). The mapping dictionary can be found here.

EgoMCQ: an egocentric video-language development set

  • Download the EgoMCQ metadata from here and put it to dataset/egomcq.json.
  • EgoMCQ is a benchmark for video-language multiple-choice questions. Given a text query, we want the model to choose the correct video clip from five candidates that sampled from two settings: inter-video or intra-video.
  • For the usage of EgoMCQ, please see our dataloader data_loader/EgoClip_EgoMCQ_dataset.py.

EgoMCQ

๐Ÿ‹๏ธโ€๏ธ Pretraining

This code is built on PyTorch with DistributedDataParallel (DDP). We pretrain EgoVLP on 4 nodes, each with 8 A100 GPUs (10 epochs in about two days).

  • Train on EgoClip: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --master_addr $CHIEF_IP --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_egoclip.py --config configs/pt/egoclip.json

  • Test on EgoMCQ: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --master_addr $CHIEF_IP --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_egoclip.py --config configs/eval/egomcq.json

  • Monitor the EgoMCQ curve during pretraining: tensorboard --logdir results --bind_all

๐Ÿ—„ Pretrained Weights

  • We have released our pretrained EgoVLP model (EgoClip w/ EgoNCE) with best performance on EgoMCQ (90.7% inter-video & 57.2% intra-video) in EgoVLP_PT_BEST.
  • Please download and put the checkpoint under: pretrained/

^ This checkpoint is used for EPIC-Kitchens, NLQ, MQ, OSSC, and PNR tasks, except for Charades-Ego. Since we found that VLP (CC3M+WebVid2M, EgoClip) alway degrades significantly on Charades-Ego after the first epoch, we evaluate Charades-Ego using the first pretraining epoch weights of EgoVLP in EgoVLP_PT_EPO1.

^^ You can use our checkpoint to power other egocentric video benchmarks. :)

๐Ÿ”ง Downstream Tasks

EPIC-Kitchens MIR

  • Preparation:
  1. Follow the instruction here, download the EPIC-Kitchens dataset (RGB frames) and annotation to path: dataset/epic-kitchens/
  2. Follow the instruction here -> How do I create the relevance matrix? to construct a relevance matrix for evaluation.
  • Results:
Model Mode # Frames Video-Text PT Weights mAP (V2T) mAP (T2V) mAP (Avg) nDCG (V2T) nDCG (T2V) nDCG (Avg)
EgoVLP Zero-shot 4 EgoClip w/ EgoNCE EgoVLP_PT_BEST 19.4 13.9 16.6 24.1 22.0 23.1
EgoVLP Fine-tuning w/
MI-MM
16 EgoClip w/ EgoNCE EgoVLP_FT_EPIC 49.9 40.5 45.0 60.9 57.9 59.4
EgoVLP+ Fine-tuning w/ Adaptive-MI-MM + Dual-softmax 16 EgoClip w/ EgoNCE EgoVLP_FT_EPIC+ 53.8 40.9 47.4 63.3 59.6 61.4

^ EgoVLP+ means our submission for Multi-Instance Retrieval@EPIC-Kitchens Challenge 2022, which equips Adaptive MI-MM loss and Dual-softmax for prediction.

  • Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_epic.py --config configs/ft/epic.json

  • Test: python3 run/test_epic.py

Charades-Ego

  • Preparation:
  1. Follow the instruction here, download the Charades-Ego dataset (480p) and annotation to path: dataset/charades/
  2. Create a training metadata via utils/charades_meta.py
  • Results:
Model Mode # Frames Video-Text PT Weights mAP
EgoVLP Zero-shot 16 EgoClip w/ EgoNCE EgoVLP_PT_EPO1 25.0
EgoVLP Fine-tuning w/ InfoNCE 16 EgoClip w/ EgoNCE EgoVLP_FT_CHARADES 32.1
  • Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_charades.py --config configs/ft/charades.json

  • Test: python3 run/test_charades.py

NLQ @ Ego4D

  • Preparation:
  1. Make sure you have prepared the NLQ metadata.
  2. For the video branch, download the EgoVLP clip-level features for NLQ. ^ We get these dense video features (fps=1.87) by script run/test_nlq.py.
  3. For the text branch, you can extract EgoVLP text features: python3 run/test_nlq.py --subsample 'text' or use our pretrained text encoder.
  4. Fine-tune the VSLNet or other methods by replacing their input video-text features.

^ We provide our VSLNet codebase which adapts EgoVLP features as an example, you can refer to the data loader and text encoder.

^ Our EgoVLP brings consistent improvement over multiple NLQ challenge baselines.

Model Video-Text Pre-extrated Features R@1, IoU=0.3 R@5, IoU=0.3 R@1, IoU=0.5 R@5, IoU=0.5
VSLNet SlowFast + BERT 5.45 10.74 3.12 6.63
VSLNet EgoVLP 10.84 18.84 6.81 13.45
CONE SlowFast + BERT 10.40 22.74 5.03 11.87
CONE EgoVLP 14.15 30.33 8.18 18.02

MQ @ Ego4D

  • Preparation:
  1. Make sure you have prepared the MQ metadata.
  2. Download the EgoVLP clip-level features for MQ. ^ We get these dense video features (fps=1.87) by script run/test_mq.py.
  3. Fine-tune the VSGN or other methods by replacing their input video features.

^ We provide our VSGN codebase which adapts EgoVLP features as an example, you can refer to the data loader.

^ Our EgoVLP brings consistent improvement over multiple MQ challenge baselines.

Model Video Pre-extrated Features R@1, IoU=0.5 R@5, IoU=0.5 mAP
VSGN SlowFast 25.16 46.18 6.03
VSGN EgoVLP 30.14 51.98 11.39
ActionFormer SlowFast + Omnivore 33.46 - 17.17
ActionFormer SlowFast + Omnivore + EgoVLP 36.84 - 20.90

OSCC @ Ego4D

  • Preparation:
  1. Make sure you have prepared the OSCC videos and metadata.
  2. Extract the clip frame follow the instruction here -> Data Preparation.
  • Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_oscc.py --config configs/ft/oscc.json
Model Video-Text Pretrained OSCC Acc %
TimeSformer ImageNet Init. 70.3
TimeSformer EgoVLP 73.9

PNR @ Ego4D

  • Preparation: Same as OSCC.
  • Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_pnr.py --config configs/ft/pnr.json
Model Video-Text Pretrained PNR Err %
TimeSformer ImageNet Init. 0.616
TimeSformer EgoVLP 0.622

^ We found VLP effect is minor in the PNR task.

๐ŸŽ“ Citation

If you find our work helps, please cite our paper.

@article{kevin2022egovlp,
  title={Egocentric Video-Language Pretraining},
  author={Lin, Kevin Qinghong and Wang, Alex Jinpeng and Soldan, Mattia and Wray, Michael and Yan, Rui and Xu, Eric Zhongcong and Gao, Difei and Tu, Rongcheng and Zhao, Wenzhe and Kong, Weijie and others},
  journal={arXiv preprint arXiv:2206.01670},
  year={2022}
}

โœ‰๏ธ Contact

This repo is maintained by Kevin. Questions and discussions are welcome via [email protected].

We are willing to merge results and codes if transfer our EgoVLP to other egocentric tasks or datasets.

๐Ÿ™ Acknowledgements

This codebase is based on Frozen.

Thanks to Alex for the help with DDP and Mattia for the help with NLQ and MQ benchmarks.

LICENSE

MIT

More Repositories

1

Awesome-Video-Diffusion

A curated list of recent diffusion models for video generation, editing, restoration, understanding, etc.
3,195
star
2

Show-1

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
Python
1,089
star
3

Tune-A-Video

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Python
1,010
star
4

Image2Paragraph

[A toolbox for fun.] Transform Image into Unique Paragraph with ChatGPT, BLIP2, OFA, GRIT, Segment Anything, ControlNet.
Python
781
star
5

MotionDirector

MotionDirector: Motion Customization of Text-to-Video Diffusion Models.
Python
747
star
6

Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
Python
684
star
7

VideoSwap

Code for [CVPR 2024] VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
342
star
8

Awesome-MLLM-Hallucination

๐Ÿ“– A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).
340
star
9

all-in-one

[CVPR2023] All in One: Exploring Unified Video-Language Pre-training
Python
277
star
10

BoxDiff

[ICCV 2023] BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
Python
239
star
11

DeVRF

The Pytorch implementation of "DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes"
Python
179
star
12

VisorGPT

[NeurIPS 2023] Customize spatial layouts for conditional image synthesis models, e.g., ControlNet, using GPT
Python
129
star
13

Awesome-GUI-Agent

๐Ÿ’ป A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.
109
star
14

Awesome-Unified-Multimodal-Models

๐Ÿ“– This is a repository for organizing papers, codes and other resources related to unified multimodal models.
106
star
15

ShowAnything

Jupyter Notebook
79
star
16

cosmo

Python
70
star
17

loveu-tgve-2023

Official GitHub repository for the Text-Guided Video Editing (TGVE) competition of LOVEU Workshop @ CVPR'23.
Python
68
star
18

sparseformer

(ICLR 2024, CVPR 2024) SparseFormer
Python
62
star
19

datacentric.vlp

Compress conventional Vision-Language Pre-training data
Python
48
star
20

Region_Learner

The Pytorch implementation for "Video-Text Pre-training with Learned Regions"
Python
42
star
21

ShowRoom3D

This is the project page of ShowRoom3D
24
star
22

Long-form-Video-Prior

Python
22
star
23

DemoVLP

[Arxiv2022] Revitalize Region Feature for Democratizing Video-Language Pre-training
Python
21
star
24

CLVQA

[AAAI2023 (Oral)] Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
Python
19
star
25

BYOC

[IEEE-VR 2024] Bring Your Own Character: A Holistic Solution for Automatic Facial Animation Generation of Customized Characters
C#
19
star
26

Q2A

[ECCV 2022] AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant
Python
18
star
27

HOSNeRF

This is the project page for the HOSNeRF
JavaScript
15
star
28

headshot

12
star
29

GEB-Plus

[ECCV 2022] GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval
Python
12
star
30

LOVA3

[NeurIPS 2024] "Learning to Visual Question Answering, Asking and Assessment"
Python
12
star
31

Show-Anything-3D

Edit and Generate Anything in 3D world!
11
star
32

Awesome-Long-Context

A curated list of resources about long-context in large-language models and video understanding.
10
star
33

SCT

[IJCV2023] Offical implementation of "SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels"
Python
10
star
34

VisInContext

Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
Python
9
star
35

SOIS

The Pytorch implementation of "Single-Stage Open-world Instance Segmentation with Cross-task Consistency Regularization"
8
star
36

AVA-AVD

Python
7
star
37

Efficient-CLS

[arXiv2022] Label-Efficient Online Continual Object Detection in Streaming Video
6
star
38

videollm-online

VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Python
6
star
39

Tune-An-Ellipse

[CVPR 2024] Tune-An-Ellipse: CLIP Has Potential to Find What You Want
6
star
40

mist

5
star
41

ColonNeRF

This is the project page for ColonNeRF.
JavaScript
4
star
42

DynVideo-E

This is the project page for DynVideo-E.
JavaScript
3
star
43

VideoLISA

[NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
3
star
44

TTC-Tuning

Revisit Parameter-Efficient Transfer Learning: A Two-Stage Paradigm
2
star
45

assistq

SCSS
1
star