• Stars
    star
    654
  • Rank 68,870 (Top 2 %)
  • Language
  • Created over 5 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A curated list of different papers and datasets in various areas of audio-visual processing

Awesome Audio-Visual: Awesome

A curated list of papers and datsets for various audio-visual tasks, inspired by awesome-computer-vision.

Contents

Audio-Visual Localization

Audio-Visual Separation

Audio-Visual Representation/Classification/Retrieval

Audio-Visual Action Recognition

Audio-Visual Spatial/Depth

Audio-Visual Highlight Detection

Audio-Visual Deepfake

Audio-Visual Navigation/RL

Audio-Visual Faces/Speech

Audio-Visual Learning of Scene Acoustics

Audio-Visual Question Answering

Cross-modal Generation (Audio-Video / Video-Audio)

Audio-Visual Stylization/Generation

Multi-modal Architectures

Uncategorized Papers

Datasets

General Audio-Visual Tasks

  • AudioSet - Audio-Visual Classification
  • MUSIC - Audio-Visual Source Separation
  • AudioSetZSL - Audio-Visual Zero-shot Learning
  • Visually Engaged and Grounded AudioSet (VEGAS) - Sound generation from video
  • SoundNet-Flickr - Image-Audio pair for cross-modal learning
  • Audio-Visual Event (AVE) - Audio-Visual Event Localization
  • AudioSet Single Source - Subset of AudioSet videos containing only a single souding object
  • Kinetics-Sounds - Subset of Kinetics dataset
  • EPIC-Kitchens - Egocentric Audio-Visual Action Recogniton
  • Audio-Visually Indicated Actions Dataset - Multimodal dataset (RGB, acoustic data as raw audio) acquired using the acoustic-optical camera
  • IMSDb dataset - Movie scripts downloaded from The Internet Script Movie Database
  • YOUTUBE-ASMR-300K dataset - ASMR videos collected from YouTube that contains stereo audio
  • FAIR-Play - 1,871 video clips and their corresponding binaural audio clips recorded in a music room
  • VGG-Sound - audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube
  • XD-Violence - weakly annotated dataset for audio-visual violence detection
  • AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE) - Geotagged aerial images and sounds, classified into 13 scene classes
  • auDIoviSual Crowd cOunting dataset (DISCO) - 1,935 Images and audios from various typical scenes, a total of 170, 270 instances annotated with the head locations.
  • MUSIC-Synthetic dataset- Category-balanced multi-source videos by artificially synthesizing solo videos from the MUSIC dataset, to facilitate the learning and evaluation of multiple-soundings-sources localization in the cocktail-party scenario.
  • ACAV100M - 140 million full-length videos (total duration 1,030 years) and produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence.
  • AIST++ - A large-scale 3D human dance motion dataset, which contains a wide variety of 3D motion paired with music It is built upon the AIST Dance Database, which is an uncalibrated multi-view collection of dance videos.
  • VideoCC - A dataset containing (video-URL, caption) pairs for training video-text machine learning models. It is created using an automatic pipeline starting from the Conceptual Captions Image-Captioning Dataset.
  • ssw60 - A dataset for research on adiovisual fine-grained categorization. The dataset covers 60 species of birds that all occur in a specific geographic location: Sapsucker Woods, Ithaca, NY. It is comprised of images from existing datasets, and brand new, expert curated audio and video data.
  • PACS - A dataset designed to help create and evaluate a new generation of AI algorithms able to reason about physical commonsense using both audio and visual modalities.
  • AVSBench - A dataset for audio-visual pixel-wise segmentation task.
  • UnAV-100 - The dataset consists of more than 10K untrimmed videos with over 30K audio-visual events covering 100 different event categories. There are often multiple audio-visual events that might be very short or long, and occur concurrently in each video as in real-life audio-visual scenes.

Face-Voice Dataset

Licenses

License

CC0

To the extent possible under law, Kranti Kumar Parida has waived all copyright and related or neighboring rights to this work.

Contributing

Please feel free to send me pull requests or email ([email protected]) to add links, correct wrong ones or if you find any broken links.