Awesome Audio-Visual:
A curated list of papers and datsets for various audio-visual tasks, inspired by awesome-computer-vision.
Contents
- Audio-Visual Localization
- Audio-Visual Separation
- Audio-Visual Representation/Classification/Retrieval
- Audio-Visual Action Recognition
- Audio-Visual Spatial/Depth
- Audio-Visual Navigation/RL
- Audio-Visual Faces/Speech
- Audio-Visual Question Answering
- Audio-Visual Stylization/Generation
- Cross-modal Generation (Audio-Video / Video-Audio)
- Multi-modal Architectures
- Uncategorized Papers
Audio-Visual Localization
- Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline - Geng, T., Wang, T., Duan, J., Cong, R., & Zheng, F. (CVPR 2023)
- Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning - Sun, W., Zhang, J., Wang, J., Liu, Z., Zhong, Y., Feng, T., ... & Barnes, N. (CVPR 2023) [code]
- Dual Perspective Network for Audio Visual Event Localization - Rao, V., Khalil, M. I., Li, H., Dai, P., & Lu, J. (ECCV 2022)
- A Proposal-Based Paradigm for Self-Supervised Sound Source Localization in Videos - Xuan, H., Wu, Z., Yang, J., Yan, Y., & Alameda-Pineda, X. (CVPR 2022)
- Mix and Localize: Localizing Sound Sources in Mixtures - Hu, X., Chen, Z., & Owens, A. (CVPR 2022) [project page] [code]
- Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks - Pan, W., Shi, H., Zhao, Z., Zhu, J., He, X., Pan, Z., ... & Tian, Q. (CVPR 2022) [code]
- Cross-Modal Background Suppression for Audio-Visual Event Localization - Xia, Y., & Zhao, Z. (CVPR 2022) [code]
- Audio-Visual Grouping Network for Sound Localization from Mixtures - Mo S., Tian Y. (CVPR 2023) [code]
- Egocentric Audio-Visual Object Localization - C. Huang, Y. Tian, A. Kumar, C. Xu (CVPR 2023) [code]
- A Closer Look at Weakly-Supervised Audio-Visual Source Localization - Mo, S., & Morgado, P. (NeurIPS 2022) [code]
- Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing - Mo S., Tian Y. (NeurIPS 2022) [code]
- Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing - Lin, Y. B., Tseng, H. Y., Lee, H. Y., Lin, Y. Y., & Yang, M. H. (NeurIPS 2021)
- Localizing Visual Sounds the Hard Way - Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., & Zisserman, A. (CVPR 2021) [code] [project page]
- Positive Sample Propagation along the Audio-Visual Event Line - Zhou, J., Zheng, L., Zhong, Y., Hao, S., & Wang, M. (CVPR 2021) [code]
- Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing - Wu Y., Yang Y. (CVPR 2021) [code]
- Audio-Visual Localization by Synthetic Acoustic Image Generation - Sanguineti V., Morerio P., Del Bue A., Murino V.(AAAI 2021)
- Binaural Audio-Visual Localization - Wu, X., Wu, Z., Ju L., Wang S. (AAAI 2021) [dataset]
- Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching - Hu, D., Qian, R., Jiang, M., Tan, X., Wen, S., Ding, E., Lin, W., Dou, D. (NeurIPS 2020) [code] [dataset] [demo]
- Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision - Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., & Yang, Z. (ECCV 2020)[project page/dataset]
- Do We Need Sound for Sound Source Localization? - Oya, T., Iwase, S., Natsume, R., Itazuri, T., Yamaguchi, S., & Morishima, S. (arXiv 2020)
- Multiple Sound Sources Localization from Coarse to Fine - Qian, R., Hu, D., Dinkel, H., Wu, M., Xu, N., & Lin, W. (ECCV 2020) [code]
- Learning Differentiable Sparse and Low Rank Networks for Audio-Visual Object Localization - Pu, J., Panagakis, Y., & Pantic, M. (ICASSP 2020)
- What Makes the Sound?: A Dual-Modality Interacting Network for Audio-Visual Event Localization - Ramaswamy, J. (ICASSP 2020)
- Self-supervised learning for audio-visual speaker diarization - Ding, Y., Xu, Y., Zhang, S. X., Cong, Y., & Wang, L. (ICASSP 2020)
- See the Sound, Hear the Pixels - Ramaswamy, J., & Das, S. (WACV 2020)
- Dual Attention Matching for Audio-Visual Event Localization - Wu, Y., Zhu, L., Yan, Y., & Yang, Y. (ICCV 2019)
- Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events - Parekh, S., Essid, S., Ozerov, A., Duong, N. Q., Pérez, P., & Richard, G. (arxiv, 2018) CPVRW2018
- Learning to Localize Sound Source in Visual Scenes - Senocak, A., Oh, T. H., Kim, J., Yang, M. H., & Kweon, I. S. (CVPR 2018)
- Objects that Sound - Arandjelovic, R., & Zisserman, A. (ECCV 2018)
- Audio-Visual Event Localization in Unconstrained Videos - Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (ECCV 2018) [project page] [code]
- Audio-visual object localization and separation using low-rank and sparsity - Pu, J., Panagakis, Y., Petridis, S., & Pantic, M. (ICASSP 2017)
Audio-Visual Separation
- iQuery: Instruments As Queries for Audio-Visual Sound Separation - Chen, J., Zhang, R., Lian, D., Yang, J., Zeng, Z., & Shi, J. (CVPR 2023) [code]
- Language-Guided Audio-Visual Source Separation via Trimodal Consistency - Tan, R., Ray, A., Burns, A., Plummer, B. A., Salamon, J., Nieto, O., ... & Saenko, K. (CVPR 2023) [code]
- Filter-Recovery Network for Multi-Speaker Audio-Visual Speech Separation - Cheng, H., Liu, Z., Wu, W., & Wang, L. (ICLR 2023)
- AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation - Tzinis, E., Wisdom, S., Remez, T., & Hershey, J. R. (ECCV 2022)
- VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer - Montesinos, J. F., Kadandale, V. S., & Haro, G. (ECCV 2022) [project page] [code]
- Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation - Chatterjee, M., Ahuja, N., & Cherian, A. (NeurIPS 2022)
- Active Audio-Visual Separation of Dynamic Sound Sources - Majumder, S. & Grauman, K. (ECCV 2022) [code] [project page] [code]
- TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation - Rahman, T., Yang, M., & Sigal, L. (NeurIPS 2021) [code]
- Move2Hear: Active Audio-Visual Source Separation - Majumder, S., Al-Halah, Z., & Grauman, K. (ICCV 2021) [code] [project page]
- Visual Scene Graphs for Audio Source Separation - Chatterjee, M., Le Roux, J., Ahuja, N., & Cherian, A. (ICCV 2021) [code] [project page]
- VisualVoice: Audio-Visual Speech Separation With Cross-Modal Consistency - Gao, R., & Grauman, K. (CVPR 2021) [code] [project page]
- Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation - Tian, Y., Hu, D., & Xu, C. (CVPR 2021) [code]
- Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation Lee, J., Chung, S. W., Kim, S., Kang, H. G., & Sohn, K. (CVPR 2021) [project page]
- Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds - zinis, E., Wisdom, S., Jansen, A., Hershey, S., Remez, T., Ellis, D.P. and Hershey, J.R. (ICLR 2021) [project page]
- Sep-stereo: Visually guided stereophonic audio generation by associating source separation - Zhou, H., Xu, X., Lin, D., Wang, X., & Liu, Z. (ECCV 2020) [project page] [code]
- Visually Guided Sound Source Separation using Cascaded Opponent Filter Network. arXiv - Zhu, L., & Rahtu, E. (arXiv 2020) [project page]
- Music Gesture for Visual Sound Separation - Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., & Torralba, A. (CVPR 2020) [project page] [code]
- Recursive Visual Sound Separation Using Minus-Plus Net - Xudong Xu, Bo Dai, Dahua Lin (ICCV 2019)
- Co-Separating Sounds of Visual Objects - Gao, R. & Grauman, K. (ICCV 2019) [project page]
- The sound of Motions - Zhao, H., Gan, C., Ma, W. & Torralba, A. (ICCV 2019)
- Learning to Separate Object Sounds by Watching Unlabeled Video - Gao, R., Feris, R., & Grauman, K. (ECCV 2018 (Oral)) [project page] [code] [dataset]
- The Sound of Pixels - Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (ECCV 2018) [project page] [code] [dataset]
Audio-Visual Representation/Classification/Retrieval
- Vision Transformers Are Parameter-Efficient Audio-Visual Learners - Lin, Y. B., Sung, Y. L., Lei, J., Bansal, M., & Bertasius, G. (CVPR 2023) [code]
- Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception - Gao, J., Chen, M., & Xu, C. (CVPR 2023) [code]
- Contrastive Audio-Visual Masked Autoencoder - Gong, Y., Rouditchenko, A., Liu, A. H., Harwath, D., Karlinsky, L., Kuehne, H., & Glass, J. R. (ICLR 2023) [code]
- Audio-Visual Segmentation - Zhou, J., Wang, J., Zhang, J., Sun, W., Zhang, J., Birchfield, S., ... & Zhong, Y. (ECCV 2022) [code]
- Temporal and cross-modal attention for audio-visual zero-shot learning - Mercea, O. B., Hummel, T., Koepke, A. S., & Akata, Z. (ECCV 2022) [code]
- Audio-Visual Mismatch-Aware Video Retrieval via Association and Adjustment - Lee, S., Park, S., & Ro, Y. M. (ECCV 2022)
- Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing - Cheng, H., Liu, Z., Zhou, H., Qian, C., Wu, W., & Wang, L. (ECCV 2022) [code]
- MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound - Zellers, R., Lu, J., Lu, X., Yu, Y., Zhao, Y., Salehi, M., ... & Choi, Y. (CVPR 2022) [project page] [code]
- Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory - Lee, S., Kim, H. I., & Ro, Y. M. (CVPR 2022)
- Sound and Visual Representation Learning With Multiple Pretraining Tasks - Vasudevan, A. B., Dai, D., & Van Gool, L. (CVPR 2022)
- Self-Supervised Object Detection From Audio-Visual Correspondence - Afouras, T., Asano, Y. M., Fagan, F., Vedaldi, A., & Metze, F. (CVPR 2022)
- Audio-Visual Generalised Zero-Shot Learning With Cross-Modal Attention and Language [project page] [code]
- Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing - Mo, S., & Tian, Y. (NeurIPS 2022) [code]
- Self-Supervised Object Detection From Audio-Visual Correspondence - Afouras, T., Asano, Y. M., Fagan, F., Vedaldi, A., & Metze, F. (CVPR 2022)
- Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language - [code]
- Learning State-Aware Visual Representations from Audible Interactions - Himangi Mittal, Pedro Morgado, Unnat Jain, Abhinav Gupta. (NeurIPS 2022) [code]
- ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning - Lee, S., Chung, J., Yu, Y., Kim, G., Breuel, T., Chechik, G., & Song, Y. (ICCV 2021) [code] [project page]
- Spoken moments: Learning joint audio-visual representations from video descriptions - Monfort, M., Jin, S., Liu, A., Harwath, D., Feris, R., Glass, J., & Oliva, A. (CVPR 2021) [project page/dataset]
- Robust Audio-Visual Instance Discrimination - Morgado, P., Misra, I., & Vasconcelos, N. (CVPR 2021)
- Distilling Audio-Visual Knowledge by Compositional Contrastive Learning - Chen, Y., Xian, Y., Koepke, A., Shan, Y., & Akata, Z. (CVPR 2021) [code]
- Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning - Zhang, J., Xu, X., Shen, F., Lu, H., Liu, X., & Shen, H. T. (AAAI 2021)
- Active Contrastive Learning of Ausio-Visual Video Representations - Ma, S., Zeng, Z., McDuff, D., & Song, Y. (ICLR 2021) [code]
- Labelling unlabelled videos from scratch with multi-modal self-supervision - Asano, Y., Patrick, M., Rupprecht, C., & Vedaldi, A. (NeruIPS 2020) [project page]
- Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation learning - Cheng, Y., Wang, R., Pan, Z., Feng, R., & Zhang, Y. (ACM MM 202)
- Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition - Di Hu, X. L., Mou, L., Jin, P., Chen, D., Jing, L., Zhu, X., & Dou, D. (ECCV 2020) [code]
- Leveraging Acoustic Images for Effective Self-Supervised Audio Representation Learning - Sanguineti, V., Morerio, P., Pozzetti, N., Greco, D., Cristani, M., & Murino, V. (ECCV 2020) [code]
- Self-Supervised Learning of Audio-Visual Objects from Video - Afouras, T., Owens, A., Chung, J. S., & Zisserman, A. (ECCV 2020) [project page]
- Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing - Tian, Y., Li, D., & Xu, C. (ECCV 2020)
- Audio-Visual Instance Discrimination with Cross-Modal Agreement - Morgado, P., Vasconcelos, N., & Misra, I. (CVPR 2021)
- Vggsound: A Large-Scale Audio-Visual Dataset - Chen, H., Xie, W., Vedaldi, A., & Zisserman, A. (ICASSP 2020) [project page/dataset] [code]
- Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data - Fayek, H. M., & Kumar, A. (IJCAI 2020)
- Multi-modal Self-Supervision from Generalized Data Transformations - Patrick, M., Asano, Y. M., Fong, R., Henriques, J. F., Zweig, G., & Vedaldi, A. (arXiv 2020)
- Curriculum Audiovisual Learning - Hu, D., Wang, Z., Xiong, H., Wang, D., Nie, F., & Dou, D. (arXiv 2020)
- Audio-visual model distillation using acoustic images - Perez, A., Sanguineti, V., Morerio, P., & Murino, V. (WACV 2020) [code] [dataset]
- Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zero-shot Classification and Retrieval of Videos - Parida, K., Matiyali, N., Guha, T., & Sharma, G. (WACV 2020) [project page][Dataset]
- Self-Supervised Learning by Cross-Modal Audio-Video Clustering - Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., & Tran, D. (NeurIPS 2020)
- Look, listen, and learn more: Design choices for deep audio embeddings - Cramer, J., Wu, H. H., Salamon, J., & Bello, J. P. (ICASSP 2019) [code] [L3-embedding]
- Self-supervised audio-visual co-segmentation - Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. (ICASSP 2019)
- Deep Multimodal Clustering for Unsupervised Audiovisual Learning - (Hu, D., Nie, F., & Li, X. (CVPR 2019))
- Cooperative learning of audio and video models from self-supervised synchronization - (Korbar, B., Tran, D., & Torresani, L. (NeurIPS 2108)) [project page][trained model 1][trained model 2]
- Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description - Hori, C., Hori, T., Wichern, G., Wang, J., Lee, T. Y., Cherian, A., & Marks, T. K. (CVPRW 2018)
- Audio-Visual Scene Analysis with Self-Supervised Multisensory Features - Owens, A., & Efros, A. A. (ECCV 2018 (Oral)) [project page] [code]
- Look, listen and learn - Arandjelovic, R., & Zisserman, A. (ICCV 2017) [Keras-code]
- Ambient Sound Provides Supervision for Visual Learning - Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (ECCV 2016(Oral)) [journal version] [project page]
- Soundnet: Learning sound representations from unlabeled video - Aytar, Y., Vondrick, C., & Torralba, A. (NIPS 2016) [project page] [code]
- See, hear, and read: Deep aligned representations - Aytar, Y., Vondrick, C., & Torralba, A. (arXiv 2017) [project page]
- Cross-Modal Embeddings for Video and Audio Retrieval -SurÃs, D., Duarte, A., Salvador, A., Torres, J., & Giró-i-Nieto, X. (ECCVW, 2018)
Audio-Visual Action Recognition
- Audio-Adaptive Activity Recognition Across Video Domains - Zhang, Y., Doughty, H., Shao, L., & Snoek, C. G. (CVPR 2022) [project page] [code]
- Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization - Lee, J., Jain, M., Park, H., & Yun, S. (ICLR 2021)
- Speech2Action: Cross-modal Supervision for Action Recognition - Nagrani, A., Sun, C., Ross, D., Sukthankar, R., Schmid, C., & Zisserman, A. (CVPR 2020) project page, dataset
- Listen to Look: Action Recognition by Previewing Audio - Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani (CVPR 2020) [project page]
- EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition - Kazakos, E., Nagrani, A., Zisserman, A., & Damen, D. (ICCV 2019) [project page] [code]
- Uncertainty-aware Audiovisual Activity Recognition using Deep Bayesian Variational Inference - Subedar, M., Krishnan, R., Meyer, P. L., Tickoo, O., & Huang, J. (ICCV 2019)
- Seeing and Hearing Egocentric Actions: How Much Can We Learn? - Cartas, A., Luque, J., Radeva, P., Segura, C., & Dimiccoli, M. (ICCVW 2019)
- How Much Does Audio Matter to Recognize Egocentric Object Interactions? - Cartas, A., Luque, J., Radeva, P., Segura, C., & Dimiccoli, M. (EPIC CVPRW 2019)
Audio-Visual Spatial/Depth
- Camera Pose Estimation and Localization with Active Audio Sensing - Yang, K., Firman, M., Brachmann, E., & Godard, C. (ECCV 2022)
- Few-Shot Audio-Visual Learning of Environment Acoustics - Majumder, S., Chen, C., Al-Halah, Z., & Grauman, K. NeurIPS (2022) [code]
- Localize to Binauralize: Audio Spatialization From Visual Sound Source Localization - Rachavarapu, K. K., Sundaresha, V., & Rajagopalan, A. N. (ICCV 2021)
- Visually Informed Binaural Audio Generation without Binaural Audios - Xu, X., Zhou, H., Liu, Z., Dai, B., Wang, X., & Lin, D. (CVPR 2021) [code]
- Beyond image to depth: Improving depth prediction using echoes - Parida, K. K., Srivastava, S., & Sharma, G. (CVPR 2021) [code] [project page]
- Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation - Lin, Yan-Bo and Wang, Yu-Chiang Frank, (AAAI 2021)
- Learning Representations from Audio-Visual Spatial Alignment - Morgado, P., Li, Y., & Nvasconcelos, N. (NeurIPS 2020) [code]
- VisualEchoes: Spatial Image Representation Learning through Echolocation - Gao, R., Chen, C., Al-Halah, Z., Schissler, C., & Grauman, K. (ECCV 2020)
- BatVision with GCC-PHAT Features for Better Sound to Vision Predictions - Christensen, J. H., Hornauer, S., & Yu, S. (CVPRW 2020)
- BatVision: Learning to See 3D Spatial Layout with Two Ears - Christensen, J. H., Hornauer, S., & Yu, S. (ICRA 2020) [dataset/code]
- Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds - Vasudevan, A. B., Dai, D., & Van Gool, L. (arXiv 2020) [project page]
- Audio-Visual SfM towards 4D reconstruction under dynamic scenes - Konno, A., Nishida K., Itoyama K., Nakadai K. (CVPRW 2020)
- Telling Left From Right: Learning Spatial Correspondence of Sight and Sound - Yang, K., Russell, B., & Salamon, J. (CVPR 2020) [project page / dataset]
- 2.5D Visual Sound - Gao, R., & Grauman, K. (CVPR 2019) [project page] [dataset] [code]
- Self-supervised generation of spatial audio for 360 video - Morgado, P., Nvasconcelos, N., Langlois, T., & Wang, O. (NeurIPS 2018) [project page] [code/dataset]
- Self-supervised audio spatialization with correspondence classifier - Lu, Y. D., Lee, H. Y., Tseng, H. Y., & Yang, M. H. (ICIP 2019)
Audio-Visual Highlight Detection
- Temporal Cue Guided Video Highlight Detection With Low-Rank Audio-Visual Fusion - Ye, Q., Shen, X., Gao, Y., Wang, Z., Bi, Q., Li, P., & Yang, G. (ICCV 2021)
- Joint Visual and Audio Learning for Video Highlight Detection - Badamdorj, T., Rochan, M., Wang, Y., & Cheng, L. (ICCV 2021)
Audio-Visual Deepfake
- Joint Audio-Visual Deepfake Detection - Zhou, Y., & Lim, S. N. (ICCV 2021)
Audio-Visual Navigation/RL
- Sound Adversarial Audio-Visual Navigation - Yu, Y., Huang, W., Sun, F., Chen, C., Wang, Y., & Liu, X. (ICLR 2022) [project page] [code]
- AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments - Paul, S., Roy-Chowdhury, A., & Cherian, A. (NeurIPS 2022)
- Semantic Audio-Visual Navigation - Chen, C., Al-Halah, Z., & Grauman, K. (CVPR 2021) [project page] [code]
- Learning to set waypoints for audio-visual navigation - Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S. K., & Grauman, K. (ICLR 2021) [project page] [code]
- See, hear, explore: Curiosity via audio-visual association - Dean, V., Tulsiani, S., & Gupta, A. (arXiv 2020) [project page] [code]
- Audio-Visual Embodied Navigation - Chen, C., Jain, U., Schissler, C., Gari, S. V. A., Al-Halah, Z., Ithapu, V. K., Robinson P., Grauman, K. (ECCV 2020) [project page]
- Look, listen, and act: Towards audio-visual embodied navigation - Gan, C., Zhang, Y., Wu, J., Gong, B., & Tenenbaum, J. B. (ICRA 2020) [project page/dataset]
Audio-Visual Faces/Speech
- DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation - Shen, S., Zhao, W., Meng, Z., Li, W., Zhu, Z., Zhou, J., & Lu, J. (CVPR 2023) [code]
- SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation - Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., ... & Wang, F. (CVPR 2023) [project page] [code]
- Parametric Implicit Face Representation for Audio-Driven Facial Reenactment - Huang, R., Lai, P., Qin, Y., & Li, G. (CVPR 2023)
- Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation - Zhu, L., Liu, X., Liu, X., Qian, R., Liu, Z., & Yu, L. (CVPR 2023) [code]
- Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring - Hong, J., Kim, M., Choi, J., & Ro, Y. M. (CVPR 2023) [code]
- AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction - Chatziagapi, A., & Samaras, D. (CVPR 2023)
- GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis - Ye, Z., Jiang, Z., Ren, Y., Liu, J., He, J., & Zhao, Z. (ICLR 2023) [code]
- Jointly Learning Visual and Auditory Speech Representations from Raw Data - Haliassos, A., Ma, P., Mira, R., Petridis, S., & Pantic, M. (ICLR 2023) [code]
- Audio-Driven Stylized Gesture Generation with Flow-Based Model - Ye, S., Wen, Y. H., Sun, Y., He, Y., Zhang, Z., Wang, Y., ... & Liu, Y. J. (ECCV 2022) [code]
- Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation - Liu, X., Xu, Y., Wu, Q., Zhou, H., Wu, W., & Zhou, B. (ECCV 2022) [project page] [code]
- Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction - Shi, B., Hsu, W. N., Lakhotia, K., & Mohamed, A. (ICLR 2022) [code]
- PoseKernelLifter: Metric Lifting of 3D Human Pose Using Sound - Yang, Z., Fan, X., Isler, V., & Park, H. S. (CVPR 2022)
- Audio-Driven Neural Gesture Reenactment With Video Motion Graphs - Zhou, Y., Yang, J., Li, D., Saito, J., Aneja, D., & Kalogerakis, E. (CVPR 2022) [code]
- Expressive Talking Head Generation With Granular Audio-Visual Control - Liang, B., Pan, Y., Guo, Z., Zhou, H., Hong, Z., Han, X., ... & Wang, J. (CVPR 2022)
- Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization - Jiang, H., Murdock, C., & Ithapu, V. K. (CVPR 2022)
- Audio-Driven Co-Speech Gesture Video Generation - Liu, X., Wu, Q., Zhou, H., Du, Y., Wu, W., Lin, D., & Liu, Z. (NeurIPS 2022) [project page] [code]
- Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis - Yang, K., Marković, D., Krenn, S., Agrawal, V., & Richard, A. (CVPR 2022) [video]
- Audio2Gestures: Generating Diverse Gestures From Speech Audio With Conditional Variational Autoencoders - Li, J., Kang, D., Pei, W., Zhe, X., Zhang, Y., He, Z., & Bao, L. (ICCV 2021) [code] [project page]
- Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association - Wen, P., Xu, Q., Jiang, Y., Yang, Z., He, Y., & Huang, Q. (CVPR 2021) [code]
- Audio-Driven Emotional Video Portraits - Ji, X., Zhou, H., Wang, K., Wu, W., Loy, C. C., Cao, X., & Xu, F. (CVPR 2021) [project page] [code]
- Pose-controllable talking face generation by implicitly modularized audio-visual representation - Zhou, H., Sun, Y., Wu, W., Loy, C. C., Wang, X., & Liu, Z. (CVPR 2021) [project page] [code]
- One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing - Wang, T. C., Mallya, A., & Liu, M. Y. (CVPR 2021) [project page]
- Unsupervised audiovisual synthesis via exemplar autoencoders - Deng, K., Bansal, A., & Ramanan, D. [project page] [project page]
- Mead: A large-scale audio-visual dataset for emotional talking-face generation - Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C., He, R., Qiao Y., Loy, C. C. (ECCV 2020) [project page/dataset]
- Discriminative Multi-modality Speech Recognition - Xu, B., Lu, C., Guo, Y., & Wang, J. (CVPR 2020)
- Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis - Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., & Jawahar, C. V. (CVPR 2020) [project page/dataset] [code]
- DAVD-Net: Deep Audio-Aided Video Decompression of Talking Heads - Zhang, X., Wu, X., Zhai, X., Ben, X., & Tu, C. (CVPR 2020)
- Learning to Have an Ear for Face Super-Resolution - Meishvili, G., Jenni, S., & Favaro, P. (CVPR 2020) [project page] [code]
- ASR is all you need: Cross-modal distillation for lip reading - Afouras, T., Chung, J. S., & Zisserman, A. (ICASSP 2020)
- Visually guided self supervised learning of speech representations - Shukla, A., Vougioukas, K., Ma, P., Petridis, S., & Pantic, M. (ICASSP 2020)
- Disentangled Speech Embeddings using Cross-modal Self-supervision - Nagrani, A., Chung, J. S., Albanie, S., & Zisserman, A. (ICASSP 2020)
- Animating Face using Disentangled Audio Representations - Mittal, G., & Wang, B. (WACV 2020)
- Deep Audio-Visual Speech Recognition - T. Afouras, J.S. Chung*, A. Senior, O. Vinyals, A. Zisserman (TPAMI 2019)
- Reconstructing faces from voices - Yandong Wen, Rita Singh, Bhiksha Raj (NIPS 2019)[project page]
- Learning Individual Styles of Conversational Gesture - Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., & Malik, J. (CVPR 2019) [project page] [dataset]
- Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss - Chen, L., Maddox, R. K., Duan, Z., & Xu, C. (CVPR 2019)[project page]
- Speech2Face: Learning the Face Behind a Voice - Oh, T. H., Dekel, T., Kim, C., Mosseri, I., Freeman, W. T., Rubinstein, M., & Matusik, W. (CVPR 2019)[project page]
- My lips are concealed: Audio-visual speech enhancement through obstructions - Afouras, T., Chung, J. S., & Zisserman, A. (INTERSPEECH 2019) [project page]
- Talking Face Generation by Adversarially Disentangled Audio-Visual Representation - Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, Xiaogang Wang (AAAI 2019) [project page] [code]
- Disjoint mapping network for cross-modal matching of voices and faces - Wen, Y., Ismail, M. A., Liu, W., Raj, B., & Singh, R. (ICLR 2019)[project page]
- X2Face: A network for controlling face generation using images, audio, and pose codes - Wiles, O., Sophia Koepke, A., & Zisserman, A. (ECCV 2018)[project page][code]
- Learnable PINs: Cross-Modal Embeddings for Person Identity - Nagrani, A., Albanie, S., & Zisserman, A. (ECCV 2018)[project page]
- Seeing voices and hearing faces: Cross-modal biometric matching - Nagrani, A., Albanie, S., & Zisserman, A. (CVPR 2018) [project page][code](trained moodel only)
- Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation - Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W.T. and Rubinstein, M., (SIGGRAPH 2018) [project page]
- The Conversation: Deep Audio-Visual Speech Enhancement - Afouras, T., Chung, J. S., & Zisserman, A. (INTERSPEECH 2018) [project page]
- VoxCeleb2: Deep Speaker Recognition - Nagrani, A., Chung, J. S., & Zisserman, A. (INTERSPEECH 2018) [dataset]
- You said that? - Son Chung, J., Jamaludin, A., & Zisserman, A. (BMVC 2017) [project page] [code](trained model, evaluation code)
- VoxCeleb: a large-scale speaker identification dataset - Nagrani, A., Chung, J. S., & Zisserman, A. (INTERSPEECH 2017) [project page][code] [dataset]
- Out of time: automated lip sync in the wild - J.S. Chung & A. Zisserman (ACCVW 2016)
Audio-Visual Learning of Scene Acoustics
- INRAS: Implicit Neural Representations of Audio Scenes - Su, K.*, Chen, M.*, Shilzerman, E. (NeurIPS 2022)
- Learning Neural Acoustic Fields - Luo, A., Du, Y., Tarr, M., Tenenbaum, J., Torralba, A., & Gan, C. (NeurIPS 2022) [code] [project page]
Audio-Visual Question Answering
- PACS: A Dataset for Physical Audiovisual CommonSense Reasoning - Yu, S., Wu, P., Liang, P. P., Salakhutdinov, R., & Morency, L. P. (ECCV 2022) [code]
- Learning To Answer Questions in Dynamic Audio-Visual Scenarios - Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J. R., & Hu, D. (CVPR 2022) [project page] [code]
Cross-modal Generation (Audio-Video / Video-Audio)
- Conditional Generation of Audio From Video via Foley Analogies - Du, Y., Chen, Z., Salamon, J., Russell, B., & Owens, A. (CVPR 2023) [project page]
- Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment - Sung-Bin, K., Senocak, A., Ha, H., Owens, A., & Oh, T. H. (CVPR 2023) [project page]
- How Does it Sound? Generation of Rhythmic Soundtracks for Human Movement Videos - Su, K., Liu, X., & Shlizerman, E. (NeurIPS 2021)
- AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ - Li, R., Yang, S., Ross, D. A., & Kanazawa, A. (ICCV 2021) [code] [project page] [dataset]
- Sound2Sight: Generating Visual Dynamics from Sound and Context - Cherian, A., Chatterjee, M., & Ahuja, N. (ECCV 2020)
- Generating Visually Aligned Sound from Videos - Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., & Gan, C. (IEEE Transactions on Image Processing 2020)
- Audeo: Audio Generation for a Silent Performance Video - Su, K., Liu, X., & Shlizerman, E. (NeurIPS 2020)
- Foley Music: Learning to Generate Music from Videos - Gan, C., Huang, D., Chen, P., Tenenbaum, J. B., & Torralba, A. (ECCV 2020) [project page]
- Spectrogram Analysis Via Self-Attention for Realizing Cross-Model Visual-Audio Generation - Tan, H., Wu, G., Zhao, P., & Chen, Y. (ICASSP 2020)
- Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck - (Shuang Ma, Daniel McDuff, Yale Song (ICCV 2019)) [code]
- Listen to the Image - (Hu, D., Wang, D., Li, X., Nie, F., & Wang, Q. (CVPR 2019))
- Cascade attention guided residue learning GAN for cross-modal translation - Duan, B., Wang, W., Tang, H., Latapie, H., & Yan, Y. (arXiv 2019) [code]
- Visual to Sound: Generating Natural Sound for Videos in the Wild - (Zhou, Y., Wang, Z., Fang, C., Bui, T., & Berg, T. L. (CVPR 2018))[project page]
- Image generation associated with music data - Qiu, Y., & Kataoka, H. (CVPRW 2018)
- CMCGAN: A uniform framework for cross-modal visual-audio mutual generation - Hao, W., Zhang, Z., & Guan, H. (AAAI 2018)
Audio-Visual Stylization/Generation
- MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation - Ruan, L., Ma, Y., Yang, H., He, H., Liu, B., Fu, J., ... & Guo, B. (CVPR 2023) [code]
- MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration - Hayes, T., Zhang, S., Yin, X., Pang, G., Sheng, S., Yang, H., ... & Parikh, D. (ECCV 2022) [project page] [code]
- Learning visual styles from audio-visual associations - Li, T., Liu, Y., Owens, A., & Zhao, H. (ECCV 2022) [project page] [code]
- Sound-Guided Semantic Image Manipulation - Lee, S. H., Roh, W., Byeon, W., Yoon, S. H., Kim, C., Kim, J., & Kim, S. (CVPR 2022) [project page] [code]
Multi-modal Architectures
- What Makes Training Multi-Modal Networks Hard? - Wang, W., Tran, D., & Feiszli, M. (arXiv 2019)
- MFAS: Multimodal Fusion Architecture Search - Pérez-Rúa, J. M., Vielzeuf, V., Pateux, S., Baccouche, M., & Jurie, F. (CVPR 2019)
Uncategorized Papers
- CASP-Net: Rethinking Video Saliency Prediction From an Audio-Visual Consistency Perceptual Perspective - Xiong, J., Wang, G., Zhang, P., Huang, W., Zha, Y., & Zhai, G. (CVPR 2023)
- Self-Supervised Video Forensics by Audio-Visual Anomaly Detection - Feng, C., Chen, Z., & Owens, A. (CVPR 2023) [code]
- Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset - Van Horn, G., Qian, R., Wilber, K., Adam, H., Mac Aodha, O., & Belongie, S. (ECCV 2022) [code]
- Learning Audio-Video Modalities from Image Captions - Nagrani, A., Seo, P. H., Seybold, B., Hauth, A., Manen, S., Sun, C., & Schmid, C. (ECCV 2022) [project page] [dataset]
- MAD: A Scalable Dataset for Language Grounding in Videos From Movie Audio Descriptions - Soldan, M., Pardo, A., Alcázar, J. L., Caba, F., Zhao, C., Giancola, S., & Ghanem, B. (CVPR 2022) [code]
- Finding Fallen Objects via Asynchronous Audio-Visual Integration - Gan, C., Gu, Y., Zhou, S., Schwartz, J., Alter, S., Traer, J., ... & Torralba, A. (CVPR 2022) [code]
- Audio-Visual Floorplan Reconstruction - S. Purushwalkam, S. V. A. Gari, V. K. Ithapu, C. Schissler, P. Robinson, A. Gupta, K. Grauman (ICCV 2021) [code] [project page]
- GLAVNet: Global-Local Audio-Visual Cues for Fine-Grained Material Recognition - (CVPR 2021)
- There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge - Valverde, F. R., Hurtado, J. V., & Valada, A. (CVPR 2021) [code] [project page/dataset]
- Can Audio-Visual Integration Strengthen Robustness Under Multimodal Attacks? - Tian, Y., & Xu, C. (CVPR 2021) [code]
- Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection? - Xuanjun Chen,et al. (SLT 2022) [Demos]
- Sight to sound: An end-to-end approach for visual piano transcription - Koepke, A. S., Wiles, O., Moses, Y., & Zisserman, A. (ICASSP 2020) [project page/dataset]
- Solos: A Dataset for Audio-Visual Music Analysis - Montesinos, J. F., Slizovskaia, O., & Haro, G. (arXiv 2020) [project page] [dataset]
- Cross-Task Transfer for Multimodal Aerial Scene Recognition - Hu, D., Li, X., Mou, L., Jin, P., Chen, D., Jing, L., ... & Dou, D. (arXiv 2020) [code] [dataset]
- STAViS: Spatio-Temporal AudioVisual Saliency Network - Tsiami, A., Koutras, P., & Maragos, P. (CVPR 2020) [code]
- AlignNet: A Unifying Approach to Audio-Visual Alignment - Wang, J., Fang, Z., & Zhao, H. (WACV 2020) [project page] [code]
- Self-supervised Moving Vehicle Tracking with Stereo Sound - Gan, C., Zhao, H., Chen, P., Cox, D., & Torralba, A. (ICCV 2019) [project page/dataset]
- Vision-Infused Deep Audio Inpainting - Zhou, H., Liu, Z., Xu, X., Luo, P., & Wang, X. (ICCV 2019) [project page] [code]
- ISNN: Impact Sound Neural Network for Audio-Visual Object Classification - Sterling, A., Wilson, J., Lowe, S., & Lin, M. C. (ECCV 2018) [project page] [dataset1][dataset2] [model]
- Audio to Body Dynamics - Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. (CVPR 2018) [project page][code]
- A Multimodal Approach to Mapping Soundscapes - Salem, T., Zhai, M., Workman, S., & Jacobs, N. (CVPRW 2018) [project page]
- Shape and material from sound - Zhang, Z., Li, Q., Huang, Z., Wu, J., Tenenbaum, J., & Freeman, B. (NeurIPS 2017)
Datasets
General Audio-Visual Tasks
- AudioSet - Audio-Visual Classification
- MUSIC - Audio-Visual Source Separation
- AudioSetZSL - Audio-Visual Zero-shot Learning
- Visually Engaged and Grounded AudioSet (VEGAS) - Sound generation from video
- SoundNet-Flickr - Image-Audio pair for cross-modal learning
- Audio-Visual Event (AVE) - Audio-Visual Event Localization
- AudioSet Single Source - Subset of AudioSet videos containing only a single souding object
- Kinetics-Sounds - Subset of Kinetics dataset
- EPIC-Kitchens - Egocentric Audio-Visual Action Recogniton
- Audio-Visually Indicated Actions Dataset - Multimodal dataset (RGB, acoustic data as raw audio) acquired using the acoustic-optical camera
- IMSDb dataset - Movie scripts downloaded from The Internet Script Movie Database
- YOUTUBE-ASMR-300K dataset - ASMR videos collected from YouTube that contains stereo audio
- FAIR-Play - 1,871 video clips and their corresponding binaural audio clips recorded in a music room
- VGG-Sound - audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube
- XD-Violence - weakly annotated dataset for audio-visual violence detection
- AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE) - Geotagged aerial images and sounds, classified into 13 scene classes
- auDIoviSual Crowd cOunting dataset (DISCO) - 1,935 Images and audios from various typical scenes, a total of 170, 270 instances annotated with the head locations.
- MUSIC-Synthetic dataset- Category-balanced multi-source videos by artificially synthesizing solo videos from the MUSIC dataset, to facilitate the learning and evaluation of multiple-soundings-sources localization in the cocktail-party scenario.
- ACAV100M - 140 million full-length videos (total duration 1,030 years) and produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence.
- AIST++ - A large-scale 3D human dance motion dataset, which contains a wide variety of 3D motion paired with music It is built upon the AIST Dance Database, which is an uncalibrated multi-view collection of dance videos.
- VideoCC - A dataset containing (video-URL, caption) pairs for training video-text machine learning models. It is created using an automatic pipeline starting from the Conceptual Captions Image-Captioning Dataset.
- ssw60 - A dataset for research on adiovisual fine-grained categorization. The dataset covers 60 species of birds that all occur in a specific geographic location: Sapsucker Woods, Ithaca, NY. It is comprised of images from existing datasets, and brand new, expert curated audio and video data.
- PACS - A dataset designed to help create and evaluate a new generation of AI algorithms able to reason about physical commonsense using both audio and visual modalities.
- AVSBench - A dataset for audio-visual pixel-wise segmentation task.
- UnAV-100 - The dataset consists of more than 10K untrimmed videos with over 30K audio-visual events covering 100 different event categories. There are often multiple audio-visual events that might be very short or long, and occur concurrently in each video as in real-life audio-visual scenes.
Face-Voice Dataset
- VoxCeleb - Audio-Visual Speaker Identification, contains two versions
- EmoVoxCeleb
- Speech2Gesture - Gesture prediction from speech
- AVSpeech
- LRW Dataset
- LRS2, LRS3, LRS3 Language - Lip Reading Datasets
Licenses
License
To the extent possible under law, Kranti Kumar Parida has waived all copyright and related or neighboring rights to this work.
Contributing
Please feel free to send me pull requests or email ([email protected]) to add links, correct wrong ones or if you find any broken links.