• Stars
    star
    143
  • Rank 257,007 (Top 6 %)
  • Language
    Python
  • Created over 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PyTorch implementation of Multi-modal Dense Video Captioning (CVPR 2020 Workshops)

MDVC

Multi-modal Dense Video Captioning

Project Page | Proceedings | ArXiv | Presentation (Mirror)

This is a PyTorch implementation of our paper Multi-modal Dense Video Captioning (CVPR Workshops 2020).

The publication will appear in the conference proceedings of CVPR Workshops. Please, use this bibtex citation

@InProceedings{MDVC_Iashin_2020,
  author = {Iashin, Vladimir and Rahtu, Esa},
  title = {Multi-Modal Dense Video Captioning},
  booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
  pages={958--959},
  year = {2020}
}

If you found this work interesting, check out our latest paper, where we propose a novel architecture for the dense video captioning task called Bi-modal Transformer with Proposal Generator.

Usage

The code is tested on Ubuntu 16.04/18.04 with one NVIDIA GPU 1080Ti/2080Ti. If you are planning to use it with other software/hardware, you might need to adapt conda environment files or even the code.

Clone this repository. Mind the --recursive flag to make sure submodules are also cloned (evaluation scripts for Python 3).

git clone --recursive https://github.com/v-iashin/MDVC.git

Download features I3D (17GB), VGGish (1GB) and put in ./data/ folder (speech segments are already there). You may use curl -O <link> to download the features.

# MD5 Hash
a661cfe3535c0d832ec35dd35a4fdc42  sub_activitynet_v1-3.i3d_25fps_stack24step24_2stream.hdf5
54398be59d45b27397a60f186ec25624  sub_activitynet_v1-3.vggish.hdf5

Setup conda environment. Requirements are in file conda_env.yml

# it will create new conda environment called 'mdvc' on your machine
conda env create -f conda_env.yml
conda activate mdvc
# install spacy language model. Make sure you activated the conda environment
python -m spacy download en

Train and Predict

Run the training and prediction script. It will, first, train the captioning model and, then, evaluate the predictions of the best model in the learned proposal setting. It will take ~24 hours (50 epochs) to run on a 2080Ti GPU. Please, note that the performance is expected to reach its peak after ~30 epochs.

# make sure to activate environment: conda activate mdvc
# the cuda:1 device will be used for the run
python main.py --device_ids 1

The script keeps the log files, including tensorboard log, under ./log directory by default. You may specify other path using --log_dir argument. Also, if you stored the downloaded data (.hdf5) files in another directory other than ./data, make sure to specify it using –-video_features_path and --audio_features_path arguments.

You may also download the pre-trained model here (~2 GB).

# MD5 Hash
55cda5bac1cf2b7a803da24fca60898b  best_model.pt

Evaluation Scrips and Results

If you want to skip the training procedure, you may replicate the main results of the paper using the prediction files in ./results and the official evaluation script.

  1. To evaluate the performance in the learned proposal set up, run the official evaluation script on ./results/results_val_learned_proposals_e30.json. Our final result is 6.8009
  2. To evaluate the performance on ground truth segments, run the script on each validation part (./results/results_val_*_e30.json) against the corresponding ground truth files (use -r argument in the script to specify each of them). When both values are obtained, average them to verify the final result. We got 9.9407 and 10.2478 on val_1 and val_2 parts, respectively, thus, the average is 10.094.

As we mentioned in the paper, we didn't have access to the full dataset as ActivityNet Captions is distributed as the list of links to YouTube video. Consequently, many videos (~8.8 %) were no longer available at the time when we were downloading the dataset. In addition, some videos didn't have any speech. We filtered out such videos from the validation files and reported the results as no missings in the paper. We provide these filtered ground truth files in ./data.

Raw Data & Details on Feature Extraction

If you are feeling brave, you may want extract features on your own. Check out our script for extraction of the I3D and VGGish features from a set of videos: video_features on GitHub (make sure to check out to 6190f3d7db6612771b910cf64e274aedba8f1e1b commit). Also see #7 for more details on configuration. We also provide the script used to process the timestamps ./utils/parse_subs.py.

Misc.

We additionally provide

  • the file with subtitles with original timestamps in ./data/asr_en.csv
  • the file with video categories in ./data/vid2cat.json

Acknowledgments

Funding for this research was provided by the Academy of Finland projects 327910 & 324346. The authors acknowledge CSC — IT Center for Science, Finland, for computational resources for our experimentation.

Media Coverage

More Repositories

1

video_features

Extract video features from raw videos using multiple GPUs. We support RAFT flow frames as well as S3D, I3D, R(2+1)D, VGGish, CLIP, and TIMM models.
Python
513
star
2

SpecVQGAN

Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
Jupyter Notebook
345
star
3

BMT

Source code for "Bi-modal Transformer for Dense Video Captioning" (BMVC 2020)
Jupyter Notebook
225
star
4

CS231n

PyTorch/Tensorflow solutions for Stanford's CS231n: "CNNs for Visual Recognition"
Jupyter Notebook
51
star
5

SparseSync

Source code for "Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors." (Spotlight at the BMVC 2022)
Python
50
star
6

WebsiteYOLO

The back-end for the YOLOv3 object detector running as a webapp
Python
44
star
7

Synchformer

Efficient synchronization from sparse cues
Python
26
star
8

VoxCeleb

An attempt to replicate the results of [1706.08612] VoxCeleb: a large-scale speaker identification dataset
Jupyter Notebook
12
star
9

CORSMAL

🏆 🏆 Top-1 Submission to CORSMAL Challenge 2020 (at ICPR). The winning solution for the CORSMAL Challenge (on Intelligent Sensing Summer School 2020)
Jupyter Notebook
8
star
10

JumpMethod

Selecting a Proper Number of Clusters Using Jumps Method
Jupyter Notebook
6
star
11

v-iashin.github.io

Personal webpage
HTML
6
star
12

FoursquareAPI

A simple example of using Foursquare API to get the data about a venue (Tips, UserId, and other info) and the data about a user (homecity, gender, number of friends, lists, checkins, photos, and tips) on Python
Jupyter Notebook
6
star
13

CrossEntropyTSP

An implementation of an approximation of the solution to Traveling Salesman Problem using cross entropy approach on Python 3
Jupyter Notebook
5
star
14

LearnablePINs

An attempt to replicate the results of [1805.00833] Learnable PINs: Cross-Modal Embeddings for Person Identity
Jupyter Notebook
4
star
15

TuniSurvivalKit

The ultimate survival kit for Ph.D. students at Tampere University
2
star
16

EM

EM-algorithm for two 1-dimentional Gaussians on vanilla-Python
Python
1
star
17

CopulaDensityEstimator

Recursive non-parametric estimation of the copula density
TeX
1
star
18

SamplingChessboardWithQueens

Simulation of an N by M chessboard with K queens such that no queen defeats another using Simulated Annealing
R
1
star