• Stars
    star
    141
  • Rank 259,971 (Top 6 %)
  • Language
  • Created about 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Survey on multimodal learning research.

Awesome Multimodality πŸŽΆπŸ“œ

Awesome GitHub stars GitHub forks GitHub activity Visitors Star History Chart

A collection of resources on multimodal learning research.

Content

1.Description

🐌 Markdown Format:

  • (Conference/Journal Year) Title, First Author et al. [Paper] [Code] [Project]
  • (Conference/Journal Year) [πŸ’¬Topic] Title, First Author et al. [Paper] [Code] [Project]
    • (Optional) 🌱 or πŸ“Œ
    • (Optional) πŸš€ or πŸ‘‘ or πŸ“š
  • 🌱: Novel idea
  • πŸ“Œ: The first...
  • πŸš€: State-of-the-Art
  • πŸ‘‘: Novel dataset/model
  • πŸ“šοΌšDownstream Tasks

2. Topic Order

  • Survey

    • (TPAMI 2023) [πŸ’¬Transformer] Multimodal Learning with Transformers: A Survey, Peng Xu et al. [v1](2022.06.13) [v2](2023.05.11)
    • (Multimedia Tools and Applications) A comprehensive survey on generative adversarial networks used for synthesizing multimedia content, Lalit Kumar & Dushyant Kumar Singh [v1](2023.03.30)
    • ⭐⭐(arXiv preprint 2023) Multimodal Deep Learning, Cem Akkus et al. [v1](2023.01.12)
    • ⭐(arXiv preprint 2022) [πŸ’¬Knowledge Enhanced] A survey on knowledge-enhanced multimodal learning, Maria Lymperaiou et al. [v1](2022.11.19)
    • ⭐⭐(arXiv preprint 2022) Vision-Language Pre-training: Basics, Recent Advances, and Future Trends, Zhe Gan et al. [v1](2022.10.17)
    • ⭐(arXiv preprint 2022) Vision+X: A Survey on Multimodal Learning in the Light of Data, Ye Zhu et al. [v1](2022.10.05)
    • (arXiv preprint 2022) Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions, Paul Pu Liang et al. [v1](2022.09.07)
    • (arXiv preprint 2022) [πŸ’¬Cardiac Image Computing] Multi-Modality Cardiac Image Computing: A Survey, Lei Li et al. [v1](2022.08.26)
    • (arXiv preprint 2022) [πŸ’¬Vision and language Pre-training (VLP)] Vision-and-Language Pretraining, Thong Nguyen et al. [v1](2022.07.05)
    • (arXiv preprint 2022) [πŸ’¬Video Saliency Detection] A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!, Chenglizhao Chen et al. [v1](2022.06.20)
    • (arXiv preprint 2022) [πŸ’¬Vision and language Pre-training (VLP)] Vision-and-Language Pretrained Models: A Survey, Siqu Long et al. [v1](2022.04.15)...[v5](2022.05.03)
    • (arXiv preprint 2022) [πŸ’¬Vision and language Pre-training (VLP)] VLP: A Survey on Vision-Language Pre-training, Feilong Chen et al. [v1](2022.02.18) [v2](2022.02.21)
    • (arXiv preprint 2022) [πŸ’¬Vision and language Pre-training (VLP)] A Survey of Vision-Language Pre-Trained Models, Yifan Du et al. [v1](2022.02.18)
    • (arXiv preprint 2022) [πŸ’¬Multi-Modal Knowledge Graph] Multi-Modal Knowledge Graph Construction and Application: A Survey, Xiangru Zhu et al. [v1](2022.02.11)
    • (arXiv preprint 2022) [πŸ’¬Auto Driving] Multi-modal Sensor Fusion for Auto Driving Perception: A Survey, Keli Huang et al. [v1](2022.02.06) [v2](2022.02.27)
    • (arXiv preprint 2021) A Survey on Multi-modal Summarization, Anubhav Jangra et al. [v1](2021.09.11)
    • (Information Fusion 2021) [πŸ’¬Vision and language] Multimodal research in vision and language: A review of current and emerging trends, ShagunUppal et al. [v1](2021.08.01)
  • πŸ‘‘ Dataset

    • (arXiv preprint 2023) Sticker820K: Empowering Interactive Retrieval with Stickers, Sijie Zhao et al. [Paper] [Github]
    • (arXiv preprint 2023) Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration, Chenyang Lyu et al. [Paper] [Github]
    • (arXiv preprint 2022) Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework, Jiaxi Gu et al. [Paper] [Download]
      • The Noah-Wukong dataset is a large-scale multi-modality Chinese dataset.
      • The dataset contains 100 Million <image, text> pairs
      • Images in the datasets are filtered according to the size ( > 200px for both dimensions ) and aspect ratio ( 1/3 ~ 3 )
      • Text in the datasets are filtered according to its language, length and frequency. Privacy and sensitive words are also taken into consideration.
    • (arXiv preprint 2022) WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models, Sha Yuan et al. [Paper] [Download]
  • πŸ’¬ Vision and language Pre-training (VLP)

    • (arXiv preprint 2023) mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video, Haiyang Xu et al. [Paper] [Code]

      • πŸ“š Downstream Tasks:
        • [Vision Only] Video Action Recognition, Image Classification, Object Detection and Segmentation
        • [Language Only] Natural Language Understanding, Natural Language Generation
        • [Video-Text] Text-to-Video Retrieval, Video Question Answering, Video Captioning
        • [Image-Text] Image-Text Retrieval, Visual Question Answering, Image Captioning, Visual Grounding
    • (EMNLP 2022) FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning, Suvir Mirchandani et al. [Paper]

      • πŸ“š Downstream Tasks: Image-to-Text Retrieval & Text-to-Image Retrieval, Image Retrieval with Text Feedback, Category Recognition & Subcategory Recognition, Image Captioning, Relative Image Captioning
    • (arXiv preprint 2022) PaLI: A Jointly-Scaled Multilingual Language-Image Model, Xi Chen et al. [Paper]

      • πŸ“š Downstream Tasks: Image Captioning, Visual Question Answering (VQA), Language-understanding Capabilities, Zero-shot Image Classification
    • ⭐⭐(arXiv preprint 2022) Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, Wenhui Wang et al. [Paper] [Code]

      PWC PWC
      PWC
      PWC
      PWC
      PWC
      PWC
      PWC
      PWC
      PWC
      PWC

      • πŸ“š 【Visual-Language】Visual Question Answering (VQA), Visual Reasoning, Image Captioning, Image-Text Retrieval
      • πŸ“š 【Visual】Object Detection, nstance Segmentation, Semantic Segmentation, Image Classification
    • (ECCV 2022) Exploiting Unlabeled Data with Vision and Language Models for Object Detection, Shiyu Zhao et al. [Paper] [Code]

      • πŸ“š Downstream Tasks: Open-vocabulary object detection, Semi-supervised object detection, Pseudo label generation
    • ⭐⭐[CVPR 2022 Tutorial] Recent Advances in Vision-and-Language Pre-training [Project]

    • ⭐⭐(arXiv preprint 2022) [πŸ’¬Data Augmentation] MixGen: A New Multi-Modal Data Augmentation, Xiaoshuai Hao et al. [Paper]

      • πŸ“š Downstream Tasks: Image-Text Retrieval, Visual Question Answering (VQA), Visual Grounding, Visual Reasoning, Visual Entailment
    • ⭐⭐(ICML 2022) Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, Yan Zeng et al. [Paper] [Code]

      • πŸš€ SOTA(2022/06/16): Cross-Modal Retrieval on COCO 2014 & Flickr30k, Visual Grounding on RefCOCO+ val & RefCOCO+ testA, RefCOCO+ testB
      • πŸ“š Downstream Tasks: Image-Text Retrieval, Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR2), Visual Grounding, Image Captioning
    • ⭐⭐(arXiv preprint 2022) Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts, Basil Mustafa et al. [Paper] [Blog]

      • πŸ“Œ LIMoE: The first large-scale multimodal mixture of experts models.
    • (CVPR 2022) Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment, Mingyang Zhou et al. [Paper] [Code]

      • πŸ“š Downstream Tasks: Visual Question Answering(VQA), Natural Language for Visual reasoning(NLVR2), Visual Entailment, Referring Expression(RefCOCO+)
    • ⭐(arXiv preprint 2022) One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code, Yong Dai et al. [Paper]

      • πŸ“š Downstream Tasks: Text Classification, Automatic Speech Recognition, Text-to-Image Retrieval, Text-to-Video Retrieval, Text-to-Code Retrieval
    • (arXiv preprint 2022) Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework, Chunyu Xie et al. [Paper]

      • πŸ“š Downstream Tasks: Image-text Retrieval, Chinese Image-text matching
    • (arXiv preprint 2022) Vision-Language Pre-Training with Triple Contrastive Learning, Jinyu Yang et al. [Paper] [Code]

      • πŸ“š Downstream Tasks: Image-text Retrieval, Visual Question Answering, Visual Entailment, Visual Reasoning
    • (arXiv preprint 2022) MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment, Zejun Li et al. [Paper]

      • πŸ“š Downstream Tasks: Image-text Retrieval, Multi-Modal Classification, Visual Grounding
    • (arXiv preprint 2022) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, Junnan Li et al. [Paper] [Code]

      • πŸ“š Downstream Tasks: Image-text Retrieval, Image Captioning, Visual Question Answering, Visual Reasoning, Visual Dialog
    • (ICML 2021) ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, Wonjae Kim et al. [Paper]

      • πŸ“š Downstream Tasks: Image Text Matching, Masked Language Modeling

3. Chronological Order

  • 2023

    • (arXiv preprint 2023) Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation, Zhiwei Zhang et al. [Paper] [Project] [Code]
    • (arXiv preprint 2023) Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models, Gen Luo et al. [Paper] [Project] [Code]
    • ⭐⭐(arXiv preprint 2023) Any-to-Any Generation via Composable Diffusion, Zineng Tang et al. [Paper] [Project] [Code]
      • πŸ“š[Single-to-Single Generation] Text β†’ Image, Audio β†’ Image, Image β†’ Video, Image β†’ Audio, Audio β†’ Text, Image β†’ Text
      • πŸ“š[Multi-Outputs Joint Generation] Text β†’ Video + Audio, Text β†’ Text + Audio + Image, Text + Image β†’ Text + Image
      • πŸ“š[Multiple Conditioning] Text + Audio β†’ Image, Text + Image β†’ Image, Text + Audio + Image β†’ Image, Text + Audio β†’ Video, Text + Image β†’ Video, Video + Audio β†’ Text, Image + Audio β†’ Audio, Text + Image β†’ Audio
    • ⭐⭐(arXiv preprint 2023) mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality, Qinghao Ye et al. [Paper] [Demo] [Code]
    • (arXiv preprint 2023) Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models, Zhiqiu Lin et al. [Paper] [Project] [Code]
  • 2022

    • (arXiv preprint 2022) [πŸ’¬Visual Metaphors] MetaCLUE: Towards Comprehensive Visual Metaphors Research, Arjun R. Akula et al. [Paper] [Project]
    • (arXiv preprint 2022) MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks, Letitia Parcalabescu et al. [Paper] [Code]
    • (arXiv preprint 2022) Versatile Diffusion: Text, Images and Variations All in One Diffusion Model, Xingqian Xu et al. [Paper] [Code] [Hugging Face]
      • πŸ“š Downstream Tasks: Text-to-Image, Image-Variation, Image-to-Text, Disentanglement, Text+Image-Guided Generation, Editable I2T2I
    • (Machine Intelligence Research) [πŸ’¬Vision-language transformer] Masked Vision-Language Transformer in Fashion, Ge-Peng Ji et al. [Paper] [Code]
    • (arXiv 2022) [πŸ’¬Multimodal Modeling] MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning, Zijia Zhao et al. [Paper]
    • (arXiv 2022) [πŸ’¬Navigation] Iterative Vision-and-Language Navigation, Jacob Krantz et al. [Paper]
    • (arXiv 2022) [πŸ’¬Video Chapter Generation] Multi-modal Video Chapter Generation, Xiao Cao et al. [Paper]
    • (arXiv 2022) [πŸ’¬Visual Question Answering (VQA)] TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation, Jun Wang et al. [Paper] [Code]
    • (AI Ethics and Society 2022) [πŸ’¬Multi-modal & Bias] American == White in Multimodal Language-and-Image AI, Robert Wolfe et al. [Paper]
    • (Interspeech 2022) [πŸ’¬Audio-Visual Speech Separation] Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation, Xiaoyu Wang et al. [Paper]
    • (arXiv preprint 2022) [πŸ’¬Multi-modal for Recommendation] Personalized Showcases: Generating Multi-Modal Explanations for Recommendations, An Yan et al. [Paper]
    • (CVPR 2022) [πŸ’¬Video Synthesis] Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning, Ligong Han et al. [Paper] [Code] [Project]
    • (NAACL 2022) [πŸ’¬Dialogue State Tracking] Multimodal Dialogue State Tracking, Hung Le et al. [Paper]
    • (arXiv preprint 2022) [πŸ’¬Multi-modal Multi-task] MultiMAE: Multi-modal Multi-task Masked Autoencoders, Roman Bachmann et al. [Paper] [Code] [Project]
    • (CVPR 2022) [πŸ’¬Text-Video Retrieval] X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval, Satya Krishna Gorti et al. [Paper] [Code] [Project]
    • (NAACL 2022 2022) [πŸ’¬Visual Commonsense] Visual Commonsense in Pretrained Unimodal and Multimodal Models, Chenyu Zhang et al. [Paper] [Code]
    • (arXiv preprint 2022) [πŸ’¬Pretraining framework] i-Code: An Integrative and Composable Multimodal Learning Framework, Ziyi Yang et al. [Paper]
    • (CVPR 2022) [πŸ’¬Food Retrieval] Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval, Mustafa Shukor et al. [Paper] [Code]
    • (arXiv preprint 2022) [πŸ’¬Image+Videos+3D Data Recognition] Omnivore: A Single Model for Many Visual Modalities, Rohit Girdhar et al. [Paper] [Code] [Project]
    • (arXiv preprint 2022) [πŸ’¬Hyper-text Language-image Model] CM3: A Causal Masked Multimodal Model of the Internet, Armen Aghajanyan et al. [Paper]
  • 2021

    • (arXiv preprint 2021) [πŸ’¬Visual Synthesis] NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion, Chenfei Wu et al. [Paper] [Code] Figure from paper

    (From: https://github.com/microsoft/NUWA [2021/11/30])

    • (ICCV 2021) [πŸ’¬Video-Text Alignment] TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment, Jianwei Yang et al. [Paper]
    • (arXiv preprint 2021) [πŸ’¬Class-agnostic Object Detection] Multi-modal Transformers Excel at Class-agnostic Object Detection, Muhammad Maaz et al. [Paper] [Code]
    • (ACMMM 2021) [πŸ’¬Video-Text Retrieval] HANet: Hierarchical Alignment Networks for Video-Text Retrieval, Peng Wu et al. [Paper] [Code]
    • (ICCV 2021) [πŸ’¬Video Recognition] AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition, Rameswar Panda et al. [Paper] [Project] [Code]
    • (ICCV 2021) [πŸ’¬Video Representation] CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations, Mohammadreza Zolfaghari et al. [Paper]
    • (ICCV 2021 Oral) [πŸ’¬Text-guided Image Manipulation] StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery, Or Patashnik et al. [Paper] [Code] [Play]
    • (ICCV 2021) [πŸ’¬Facial Editing] Talk-to-Edit: Fine-Grained Facial Editing via Dialog, Yuming Jiang et al. [Paper] [Code] [Project] [Dataset Project] [Dataset(CelebA-Dialog Dataset)]
    • (arXiv preprint 2021) [πŸ’¬Video Action Recognition] ActionCLIP: A New Paradigm for Video Action Recognition, Mengmeng Wang et al. [Paper]
  • 2020

    • (EMNLP 2020) [πŸ’¬Video+Language Pre-training] HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training, Linjie Li et al. [Paper] [Code]

3.Courses

Contact Me