Awesome Multi-Modal Reinforcement Learning

This is a collection of research papers for Multi-Modal reinforcement learning (MMRL). And the repository will be continuously updated to track the frontier of MMRL. Some papers may not be relevant to RL, but we include them anyway as they may be useful for the research of MMRL.

Welcome to follow and star!

Introduction

Multi-Modal RL agents focus on learning from video (images), language (text), or both, as humans do. We believe that it is important for intelligent agents to learn directly from images or text, since such data can be easily obtained from the Internet.

A Taxonomy of Multi-Modal Reinforcement Learning
Papers
- ICLR 2023(New!!!)
- ICLR 2022
- ICLR 2021
- ICLR 2019
- Neurips 2022
- Neurips 2021
- Neurips 2018
- ICML 2022
- ICML 2019
- ICML 2017
- CVPR 2022
- CoRL 2022
- Arxiv
Contributing

Papers

format:
- [title](paper link) [links]
  - authors.
  - key words.
  - experiment environment.

ICLR 2023

PaLI: A Jointly-Scaled Multilingual Language-Image Model(notable top 5%)
- Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
- Keyword: amazing zero-shot, language component and visual component
- ExpEnv: None
VIMA: General Robot Manipulation with Multimodal Prompts
- Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, Linxi Fan. NeurIPS Workshop 2022
- Key Words: multimodal prompts, transformer-based generalist agent model, large-scale benchmark
- ExpEnv: VIMA-Bench, VIMA-Data
MIND ’S EYE: GROUNDED LANGUAGE MODEL REASONING THROUGH SIMULATION
- Ruibo Liu, Jason Wei, Shixiang Shane Gu, Te-Yen Wu, Soroush Vosoughi, Claire Cui, Denny Zhou, Andrew M. Dai
- Keyword: language2physical-world, reasoning ability
- ExpEnv: MuJoCo

ICLR 2022

How Much Can CLIP Benefit Vision-and-Language Tasks?
- Sheng Shen, Liunian Harold Li, Hao Tan, etc. ICLR 2022
- Key Words: Vision-and-Language, CLIP
- ExpEnv: None

ICLR 2021

Grounding Language to Entities and Dynamics for Generalization in Reinforcement Learning
- Austin W. Hanjie, Victor Zhong, Karthik Narasimhan. ICML 2021
- Key Words: Multi-modal Attention
- ExpEnv: Messenger
Mastering Atari with Discrete World Models
- Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, etc.
- Key Words: World models
- ExpEnv: Atari
Decoupling Representation Learning from Reinforcement Learning
- Adam Stooke,Kimin Lee,Pieter Abbeel, etc.
- Key Words: representation learning, unsupervised learning
- ExpEnv: DeepMind Control, Atari, DMLab

ICLR 2019

Learning Actionable Representations with Goal-Conditioned Policies
- Dibya Ghosh, Abhishek Gupta, Sergey Levine.
- Key Words: Actionable Representations Learning
- ExpEnv: 2D navigation(2D Wall, 2D Rooms, Wheeled, Wheeled Rooms, Ant, Pushing)

NeurIPS 2022

MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
- Linxi Fan, Guanzhi Wang, Yunfan Jiang, etc.
- Key Words: multimodal dataset, MineClip
- ExpEnv: Minecraft
Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
- Bowen Baker, Ilge Akkaya, Peter Zhokhov, etc.
- Key Words: Inverse Dynamics Model
- ExpEnv: minerl

NeurIPS 2021

SOAT: A Scene-and Object-Aware Transformer for Vision-and-Language Navigation
- Abhinav Moudgil, Arjun Majumdar,Harsh Agrawal, etc.
- Key Words: Vision-and-Language Navigation
- ExpEnv: Room-to-Room, Room-Across-Room
Pretraining Representations for Data-Efﬁcient Reinforcement Learning
- Max Schwarzer, Nitarshan Rajkumar, Michael Noukhovitch, etc.
- Key Words: latent dynamics modelling, unsupervised RL
- ExpEnv: Atari

NeurIPS 2018

Recurrent World Models Facilitate Policy Evolution
- David Ha, Jürgen Schmidhuber.
- Key Words: World model, generative RNN, VAE
- ExpEnv: VizDoom, CarRacing

ICML 2022

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
- Wenlong Huang, Pieter Abbeel, Deepak Pathak, etc.
- Key Words: large language models, Embodied Agents
- ExpEnv: VirtualHome
Reinforcement Learning with Action-Free Pre-Training from Videos
- Younggyo Seo, Kimin Lee, Stephen L James, etc.
- Key Words: action-free pretraining, videos
- ExpEnv: Meta-world, DeepMind Control Suite
History Compression via Language Models in Reinforcement Learning
- Fabian Paischer, Thomas Adler, Vihang Patil, etc.
- Key Words: Pretrained Language Transformer
- ExpEnv: Minigrid, Procgen

ICML 2019

Learning Latent Dynamics for Planning from Pixels
- Danijar Hafner, Timothy Lillicrap, Ian Fischer, etc.
- Key Words: latent dynamics model, pixel observations
- ExpEnv: DeepMind Control Suite

ICML 2017

Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning
- Junhyuk Oh, Satinder Singh, Honglak Lee, Pushmeet Kohli
- Key Words: unseen instruction, sequential instruction
- ExpEnv: Minecraft

CVPR 2022

End-to-end Generative Pretraining for Multimodal Video Captioning
- Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid
- Key Words: Multimodal video captioning, Pretraining using a future utterance, Multimodal Video Generative Pretraining
- ExpEnv: HowTo100M
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
- Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei
- Key Words: backbone architecture, pretraining task, model scaling up
- ExpEnv: ADE20K, COCO, NLVR2, Flickr30K
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
- Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
- Keyword: dual-scale graph transformer, dual-scale graph transformer, affordance detection
- ExpEnv: None
Masked Visual Pre-training for Motor Control
- Tete Xiao, Ilija Radosavovic, Trevor Darrell, etc. ArXiv 2022
- Key Words: self-supervised learning, motor control
- ExpEnv: Isaac Gym

CoRL 2022

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action
- Dhruv Shah, Blazej Osinski, Brian Ichter, Sergey Levine
- Key Words: robotic navigation, goal-conditioned, unannotated large dataset, CLIP, ViNG, GPT-3
- ExpEnv: None
[Real-World Robot Learning with Masked Visual Pre-training](https://arxiv.org/abs/2210.03109）
- Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, Trevor Darrell
- Key Words: real-world robotic tasks，
- ExpEnv: None
R3M: A Universal Visual Representation for Robot Manipulation
- Suraj Nair, Aravind Rajeswaran, Vikash Kumar, etc.
- Key Words: Ego4D human video dataset, pre-train visual representation
- ExpEnv: MetaWorld, Franka Kitchen, Adroit

Other

Language Conditioned Imitation Learning over Unstructured Data RSS 2021
- Corey Lynch, Pierre Sermanet
- Keyword: open-world environments
- ExpEnv: None
Learning Generalizable Robotic Reward Functions from “In-The-Wild” Human Videos RSS 2021
- Annie S. Chen, Suraj Nair, Chelsea Finn.
- Key Words: Reward Functions, “In-The-Wild” Human Videos
- ExpEnv: None
Offline Reinforcement Learning from Images with Latent Space Models L4DC 2021
- Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, etc.
- Key Words: Latent Space Models
- ExpEnv: DeepMind Control, Adroit Pen, Sawyer Door Open, Robel D’Claw Screw
Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion Recognition? ICASSP 2022
- Vandana Rajan, Alessio Brutti, Andrea Cavallaro.
- Key Words: Multi-Modal Emotion Recognition, Cross-Attention
- ExpEnv: None

ArXiv

Multimodal Reinforcement Learning for Robots Collaborating with Humans
- Afagh Mehri Shervedani, Siyu Li, Natawut Monaikul, Bahareh Abbasi, Barbara Di Eugenio, Milos Zefran
- Key Words: robust and deliberate decisions, end-to-end training, importance enhancement, similarity, improve IRL training process multimodal RL domains
- ExpEnv: None
See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction
- Maria Attarian, Advaya Gupta, Ziyi Zhou, Wei Yu, Igor Gilitschenski, Animesh Garg
- Keyword: cognitive planning, language-guided video prediction
- ExpEnv: None
Open-vocabulary Queryable Scene Representations for Real World Planning
- Boyuan Chen, Fei Xia, Brian Ichter, Kanishka Rao, Keerthana Gopalakrishnan, Michael S. Ryoo, Austin Stone, Daniel Kappler
- Key Words: Target Detection, Real World, Robotic Tasks
- ExpEnv: Say Can
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
- Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, Andy Zeng
- Key Words: real world, natural language
- ExpEnv: Say Can

Contributing

Our purpose is to make this repo even better. If you are interested in contributing, please refer to HERE for instructions in contribution.

License

Awesome Multi-Modal Reinforcement Learning is released under the Apache 2.0 license.

opendilab/awesome-multi-modal-reinforcement-learning

opendilab

Reviews

Repository Details