• Stars
    star
    122
  • Rank 292,031 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 5 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[ACL 2020] PyTorch code for TVQA+: Spatio-Temporal Grounding for Video Question Answering

TVQA+: Spatio-Temporal Grounding for Video Question Answering

qa_example

We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos. We first augment the TVQA dataset with 310.8k bounding boxes, linking depicted objects to visual concepts in questions and answers. We name this augmented version as TVQA+. We then propose Spatio-Temporal Answerer with Grounded Evidence (STAGE), a unified framework that grounds evidence in both the spatial and temporal domains to answer questions about videos. Comprehensive experiments and analyses demonstrate the effectiveness of our framework and how the rich annotations in our TVQA+ dataset can contribute to the question answering task. As a side product, by performing this joint task, our model is able to produce more insightful intermediate results.

In this repository, we provide PyTorch Implementation of the STAGE model, along with basic preprocessing and evaluation code for TVQA+ dataset.

TVQA+: Spatio-Temporal Grounding for Video Question Answering
Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal. [PDF]

Resources

Model

  • STAGE Overview. Spatio-Temporal Answerer with Grounded Evidence (STAGE), a unified framework that grounds evidence in both the spatial and temporal domains to answer questions about videos.
    model_overview

  • Prediction Examples example_predictions

Requirements

  • Python 2.7
  • PyTorch 1.1.0 (should work for 0.4.0 - 1.2.0)
  • tensorboardX
  • tqdm
  • h5py
  • numpy

Training and Evaluation

1, Download and uncompress preprocessed features from Google Drive.

& uncompress the file into project root directory, you should get a dir `tvqa_plus_stage_features` 
containing all the required feature files.
cd $PROJECT_ROOT; tar -xf tvqa_plus_stage_features_new.tar.gz

gdrive is a good tool to use for downloading the file. The features are changed, you have to re-download the features if you have our previous version

2, Run in debug mode to test your environment, path settings:

bash run_main.sh debug

3, Train the full STAGE model:

bash run_main.sh --add_local

note you will need around 30 GB of memory to load the data. Otherwise, you can additionally add --no_core_driver flag to stop loading all the features into memory. After training, you should be able to get ~72.00% QA Acc, which is comparable to the reported number. The trained model and config file are stored at ${$PROJECT_ROOT}/results/${MODEL_DIR}

4, Inference

bash run_inference.sh --model_dir ${MODEL_DIR} --mode ${MODE}

${MODE} could be valid or test. After inference, you will get a ${MODE}_inference_predictions.json file in ${MODEL_DIR}, which is similar to the sample prediction file here eval/data/val_sample_prediction.json.

5, Evaluation

cd eval; python eval_tvqa_plus.py --pred_path ../results/${MODEL_DIR}/valid_inference_predictions.json --gt_path data/tvqa_plus_val.json

Note you can only evaluate val prediction here. To evaluate test set, please follow instructions here.

Citation

@inproceedings{lei2019tvqa,
  title={TVQA+: Spatio-Temporal Grounding for Video Question Answering},
  author={Lei, Jie and Yu, Licheng and Berg, Tamara L and Bansal, Mohit},
  booktitle={Tech Report, arXiv},
  year={2019}
}

TODO

  1. Add data preprocessing scripts (provided preprocessed features)
  2. Add model and training scripts
  3. Add inference and evaluation scripts

Contact

  • Dataset: faq-tvqa-unc [at] googlegroups.com
  • Model: Jie Lei, jielei [at] cs.unc.edu

More Repositories

1

animeGAN

A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing.
Jupyter Notebook
1,277
star
2

ClipBERT

[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.
Python
700
star
3

scipy-lecture-notes-zh-CN

中文版scipy-lecture-notes. 网站下线, 以离线HTML的形式继续更新, 见release.
Python
409
star
4

moment_detr

[NeurIPS 2021] Moment-DETR code and QVHighlights dataset
Python
254
star
5

TVQA

[EMNLP 2018] PyTorch code for TVQA: Localized, Compositional Video Question Answering
Python
168
star
6

recurrent-transformer

[ACL 2020] PyTorch code for MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
Jupyter Notebook
167
star
7

TVRetrieval

[ECCV 2020] PyTorch code for XML on TVRetrieval dataset - TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
Python
151
star
8

singularity

[ACL 2023] Official PyTorch code for Singularity model in "Revealing Single Frame Bias for Video-and-Language Learning"
Python
127
star
9

TVCaption

[ECCV 2020] PyTorch code of MMT (a multimodal transformer captioning model) on TVCaption dataset
Python
86
star
10

VideoLanguageFuturePred

[EMNLP 2020] What is More Likely to Happen Next? Video-and-Language Future Event Prediction
Python
47
star
11

mTVRetrieval

[ACL 2021] mTVR: Multilingual Video Moment Retrieval
Python
26
star
12

classification-with-coarse-fine-labels

Code accompanying the paper Weakly Supervised Image Classification with Coarse and Fine Labels.
Lua
8
star
13

my-scripts

Collections of useful scripts for my daily usage
Python
1
star
14

pytorch-pretrained-BERT

A copy from https://github.com/huggingface/pytorch-pretrained-BERT
Jupyter Notebook
1
star