• Stars
    star
    254
  • Rank 160,264 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 3 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[NeurIPS 2021] Moment-DETR code and QVHighlights dataset

Moment-DETR

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries, NeurIPS 2021

Jie Lei, Tamara L. Berg, Mohit Bansal

This repo contains a copy of QVHighlights dataset for moment retrieval and highlight detections. For details, please check data/README.md This repo also hosts the Moment-DETR model (see overview below), a new model that predicts moment coordinates and saliency scores end-to-end based on a given text query. This released code supports pre-training, fine-tuning, and evaluation of Moment-DETR on the QVHighlights datasets. It also supports running prediction on your own raw videos and text queries.

model

Table of Contents

Getting Started

Prerequisites

  1. Clone this repo
git clone https://github.com/jayleicn/moment_detr.git
cd moment_detr
  1. Prepare feature files

Download moment_detr_features.tar.gz (8GB), extract it under project root directory:

tar -xf path/to/moment_detr_features.tar.gz

The features are extracted using Linjie's HERO_Video_Feature_Extractor. If you want to use your own choices of video features, please download the raw videos from this link.

  1. Install dependencies.

This code requires Python 3.7, PyTorch, and a few other Python libraries. We recommend creating conda environment and installing all the dependencies as follows:

# create conda env
conda create --name moment_detr python=3.7
# activate env
conda actiavte moment_detr
# install pytorch with CUDA 11.0
conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch
# install other python packages
pip install tqdm ipython easydict tensorboard tabulate scikit-learn pandas

The PyTorch version we tested is 1.9.0.

Training

Training can be launched by running the following command:

bash moment_detr/scripts/train.sh 

This will train Moment-DETR for 200 epochs on the QVHighlights train split, with SlowFast and Open AI CLIP features. The training is very fast, it can be done within 4 hours using a single RTX 2080Ti GPU. The checkpoints and other experiment log files will be written into results. For training under different settings, you can append additional command line flags to the command above. For example, if you want to train the model without the saliency loss (by setting the corresponding loss weight to 0):

bash moment_detr/scripts/train.sh --lw_saliency 0

For more configurable options, please checkout our config file moment_detr/config.py.

Inference

Once the model is trained, you can use the following command for inference:

bash moment_detr/scripts/inference.sh CHECKPOINT_PATH SPLIT_NAME  

where CHECKPOINT_PATH is the path to the saved checkpoint, SPLIT_NAME is the split name for inference, can be one of val and test.

Pretraining and Finetuning

Moment-DETR utilizes ASR captions for weakly supervised pretraining. To launch pretraining, run:

bash moment_detr/scripts/pretrain.sh 

This will pretrain the Moment-DETR model on the ASR captions for 100 epochs, the pretrained checkpoints and other experiment log files will be written into results. With the pretrained checkpoint, we can launch finetuning from a pretrained checkpoint PRETRAIN_CHECKPOINT_PATH as:

bash moment_detr/scripts/train.sh  --resume ${PRETRAIN_CHECKPOINT_PATH}

Note that this finetuning process is the same as standard training except that it initializes weights from a pretrained checkpoint.

Evaluation and Codalab Submission

Please check standalone_eval/README.md for details.

Train Moment-DETR on your own dataset

To train Moment-DETR on your own dataset, please prepare your dataset annotations following the format of QVHighlights annotations in data, and extract features using HERO_Video_Feature_Extractor. Next copy the script moment_detr/scripts/train.sh and modify the dataset specific parameters such as annotation and feature paths. Now you are ready to use this script for training as described in Training.

Run predictions on your own videos and queries

You may also want to run Moment-DETR model on your own videos and queries. First you need to add a few libraries for feature extraction to your environment. Before this, you should have already installed PyTorch and other libraries for running Moment-DETR following instuctions in previous sections.

pip install ffmpeg-python ftfy regex

Next, run the example provided in this repo:

PYTHONPATH=$PYTHONPATH:. python run_on_video/run.py

This will load the Moment-DETR model checkpoint trained with CLIP image and text features, and make predictions for the video RoripwjYFp8_60.0_210.0.mp4 with its associated query in run_on_video/example/queries.jsonl. The output will look like the following:

Build models...
Loading feature extractors...
Loading CLIP models
Loading trained Moment-DETR model...
Run prediction...
------------------------------idx0
>> query: Chef makes pizza and cuts it up.
>> video_path: run_on_video/example/RoripwjYFp8_60.0_210.0.mp4
>> GT moments: [[106, 122]]
>> Predicted moments ([start_in_seconds, end_in_seconds, score]): [
    [49.967, 64.9129, 0.9421], 
    [66.4396, 81.0731, 0.9271], 
    [105.9434, 122.0372, 0.9234], 
    [93.2057, 103.3713, 0.2222], 
    ..., 
    [45.3834, 52.2183, 0.0005]
   ]
>> GT saliency scores (only localized 2-sec clips): 
    [[2, 3, 3], [2, 3, 3], ...]
>> Predicted saliency scores (for all 2-sec clip): 
    [-0.9258, -0.8115, -0.7598, ..., 0.0739, 0.1068]   

You can see the 3rd ranked moment [105.9434, 122.0372] matches quite well with the ground truth of [106, 122], with a confidence score of 0.9234. You may want to refer to data/README.md for more info about how the ground-truth is organized. Your predictions might slightly differ from the predictions here, depends on your environment.

To run predictions on your own videos and queries, please take a look at the run_example function inside the run_on_video/run.py file.

Acknowledgement

We thank Linjie Li for the helpful discussions. This code is based on detr and TVRetrieval XML. We used resources from mdetr, MMAction2, CLIP, SlowFast and HERO_Video_Feature_Extractor. We thank the authors for their awesome open-source contributions.

LICENSE

The annotation files are under CC BY-NC-SA 4.0 license, see ./data/LICENSE. All the code are under MIT license, see LICENSE.

More Repositories

1

animeGAN

A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing.
Jupyter Notebook
1,277
star
2

ClipBERT

[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.
Python
700
star
3

scipy-lecture-notes-zh-CN

中文版scipy-lecture-notes. 网站下线, 以离线HTML的形式继续更新, 见release.
Python
409
star
4

TVQA

[EMNLP 2018] PyTorch code for TVQA: Localized, Compositional Video Question Answering
Python
168
star
5

recurrent-transformer

[ACL 2020] PyTorch code for MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
Jupyter Notebook
167
star
6

TVRetrieval

[ECCV 2020] PyTorch code for XML on TVRetrieval dataset - TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
Python
151
star
7

singularity

[ACL 2023] Official PyTorch code for Singularity model in "Revealing Single Frame Bias for Video-and-Language Learning"
Python
127
star
8

TVQAplus

[ACL 2020] PyTorch code for TVQA+: Spatio-Temporal Grounding for Video Question Answering
Python
122
star
9

TVCaption

[ECCV 2020] PyTorch code of MMT (a multimodal transformer captioning model) on TVCaption dataset
Python
86
star
10

VideoLanguageFuturePred

[EMNLP 2020] What is More Likely to Happen Next? Video-and-Language Future Event Prediction
Python
47
star
11

mTVRetrieval

[ACL 2021] mTVR: Multilingual Video Moment Retrieval
Python
26
star
12

classification-with-coarse-fine-labels

Code accompanying the paper Weakly Supervised Image Classification with Coarse and Fine Labels.
Lua
8
star
13

my-scripts

Collections of useful scripts for my daily usage
Python
1
star
14

pytorch-pretrained-BERT

A copy from https://github.com/huggingface/pytorch-pretrained-BERT
Jupyter Notebook
1
star