TVSum Dataset
Title-based Video Summarization (TVSum) dataset used in our CVPR 2015 paper "TVSum: Summarizing web videos using titles."
Overview
Title-based Video Summarization (TVSum) dataset serves as a benchmark to validate video summarization techniques. It contains 50 videos of various genres (e.g., news, how-to, documentary, vlog, egocentric) and 1,000 annotations of shot-level importance scores obtained via crowdsourcing (20 per video). The video and annotation data permits an automatic evaluation of various video summarization techniques, without having to conduct (expensive) user study.
The videos, collected from YouTube, comes with the Creative Commons CC-BY (v3.0) license. We release both the video files and their URLs. The shot-level importance scores are annotated via Amazon Mechanical Turk -- each video was annotated by 20 crowd-workers. The dataset has been reviewed to conform to Yahoo's data protection standards, including strict controls on privacy.
Task
The primary task of the dataset is video summarization, where the goal is to create a short, meaningful summary of a given video. The summary may contain a few shots that capture the highlights of a video and are non-redundant. Although the task is inherently subjective, we carefully curated the dataset and annotated it so that the evaluation is done in an objective way. (We have a reasonably high degree of inter-rater reliability, with the Cronbach’s alpha of 0.81.)
Evaluations
Let’s say we’ve generated a 15 second-long summary of video “Will a cat eat dog food?”, shown below:
How do we know the quality of the generated summary? We can use the shot-level importance scores included in the dataset.
Each video is annotated as follows: We chopped up a video into a set of 2 second-long shots and asked 20 users to rate how important each shot is, compared to other shots from the same video, on a scale from 1 (not important) to 5 (very important). The average of responses become the shot-level importance scores of the video, e.g.,
From the shot-level importance scores, we can formulate video summarization as the 0/1 Knapsack Problem: given a time budget (e.g., 15 seconds), maximize the total importance score of the shots included in a summary. The solution will be a vector of size T (the number of frames in a video), whose values are either 0 (not in summary) or 1 (include in summary) -- let’s call this solution the gold standard.
If our summary is similar to the gold standard, it means our summary is similar to how the 20 annotators would’ve summarized the video. We can represent our summary as a vector of size T, whose elements are 1 if it is in the summary. We are now ready to compare how good the produced summary is by comparing the two vectors of same size, using various metrics (e.g., F1 score).
How to get the data
Update on Nov 10, 2019: The dataset is no longer available on Yahoo! Webscope. Please download the dataset directly from http://people.csail.mit.edu/yalesong/tvsum.
The dataset is available as part of the Yahoo! WebScope program. Follow the steps below to download the dataset:
Visit: I4 - Title-based Video Summarization dataset, version 1.1 (644M)Sign in to your Yahoo account.Click “Select this Dataset” - “View Cart”.Fill out “Research Purpose” section.Click “Continue”.
Note for the participants of the 2016 LDV Vision Summit - Entrepreneurial Computer Vision Challenges (ECVC): We’ve arranged a special approval process for the ECVC participants so you do not need an academic email address to get an approval to download. Please mention in the form (“Research Purpose” section) that you are a participant of the challenge.
Reference
Song, Yale, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. "TVSum: Summarizing web videos using titles." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5179-5187. 2015.