Awesome Exploration Methods in Reinforcement Learning
Here is a collection of research papers for Exploration methods in Reinforcement Learning (ERL). The repository will be continuously updated to track the frontier of ERL.
Welcome to follow and star!
The balance of exploration and exploitation is one of the most central problems in reinforcement learning. In order to give readers an intuitive feeling for exploration, we provide a visualization of a typical hard exploration environment in MiniGrid below. In this task, a series of actions to achieve the goal often require dozens or even hundreds of steps, in which the agent needs to fully explore different state-action spaces in order to learn the skills required to achieve the goal.
A typical hard-exploration environment: MiniGrid-ObstructedMaze-Full-v0.
Table of Contents
A Taxonomy of Exploration RL Methods
(Click to Collapse)
In general, we can divide reinforcement learning process into two phases: collect phase and train phase. In the collect phase, the agent chooses actions based on the current policy and then interacts with the environment to collect useful experience. In the train phase, the agent uses the collected experience to update the current policy to obtain a better performing policy.
According to the phase the exploration component is explicitly applied, we simply divide the methods in Exploration RL
into two main categories: Augmented Collecting Strategy
, Augmented Training Strategy
:
-
Augmented Collecting Strategy
represents a variety of different exploration strategies commonly used in the collect phase, which we further divide into four categories:Action Selection Perturbation
Action Selection Guidance
State Selection Guidance
Parameter Space Perturbation
-
Augmented Training Strategy
represents a variety of different exploration strategies commonly used in the train phase, which we further divide into seven categories:Count Based
Prediction Based
Information Theory Based
Entropy Augmented
Bayesian Posterior Based
Goal Based
(Expert) Demo Data
Note that there may be overlap between these categories, and an algorithm may belong to several of them. For other detailed survey on exploration methods in RL, you can refer to Tianpei Yang et al and Susan Amin et al.
A non-exhaustive, but useful taxonomy of methods in Exploration RL. We provide some example methods for each of the different categories, shown in blue area above.
Here are the links to the papers that appeared in the taxonomy:
[1] Go-Explore: Adrien Ecoffet et al, 2021
[2] NoisyNet, Meire Fortunato et al, 2018
[3] DQN-PixelCNN: Marc G. Bellemare et al, 2016
[4] #Exploration Haoran Tang et al, 2017
[5] EX2: Justin Fu et al, 2017
[6] ICM: Deepak Pathak et al, 2018
[7] RND: Yuri Burda et al, 2018
[8] NGU: Adrià Puigdomènech Badia et al, 2020
[9] Agent57: Adrià Puigdomènech Badia et al, 2020
[10] VIME: Rein Houthooft et al, 2016
[11] EMI: Wang et al, 2019
[12] DIYAN: Benjamin Eysenbach et al, 2019
[13] SAC: Tuomas Haarnoja et al, 2018
[14] BootstrappedDQN: Ian Osband et al, 2016
[15] PSRL: Ian Osband et al, 2013
[16] HER Marcin Andrychowicz et al, 2017
[17] DQfD: Todd Hester et al, 2018
[18] R2D3: Caglar Gulcehre et al, 2019
Papers
format:
- [title](paper link) (presentation type, openreview score [if the score is public])
- author1, author2, author3, ...
- Key: key problems and insights
- ExpEnv: experiment environments
Classic Exploration RL Papers
(Click to Collapse)
- Using Confidence Bounds for Exploitation-Exploration Trade-offs Journal of Machine Learning Research, 2002
- Peter Auer
- Key: linear contextual bandits
- ExpEnv: None
-
A Contextual-Bandit Approach to Personalized News Article Recommendation WWW 2010
- Lihong Li, Wei Chu, John Langford, Robert E. Schapire
- Key: LinUCB
- ExpEnv: Yahoo! Front Page Today Module dataset
-
(More) Efficient Reinforcement Learning via Posterior Sampling NeurIPS 2013
- Ian Osband, Benjamin Van Roy, Daniel Russo
- Key: prior distribution, posterior sampling
- ExpEnv: RiverSwim
-
An empirical evaluation of thompson sampling NeurIPS 2011
- Olivier Chapelle, Lihong Li
- Key: Thompson sampling, empirical results
- ExpEnv: None
-
A Tutorial on Thompson Sampling arxiv 2017
- Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen
- Key: Thompson sampling
- ExpEnv: None
-
Unifying Count-Based Exploration and Intrinsic Motivation NeurIPS 2016
- Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, Remi Munos
- Key: intrinsic motivation, density models, pseudo-count
- ExpEnv: Atari
-
Deep Exploration via Bootstrapped DQN NeurIPS 2016
- Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin Van Roy
- Key: temporally-extended (or deep) exploration, randomized value functions, bootstrapped DQN
- ExpEnv: Atari
-
VIME: Variational information maximizing exploration NeurIPS 2016
- Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel
- Key: maximization of information gain, belief of environment dynamics, variational inference in Bayesian neural networks
- ExpEnv: rllab
-
#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning NeurIPS 2017
-
EX2: Exploration with Exemplar Models for Deep Reinforcement Learning NeurIPS 2017
-
Hindsight Experience Replay NeurIPS 2017
- Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, Wojciech Zaremba
- Key: hindsight experience replay, implicit curriculum
- ExpEnv: pushing, sliding, pick-and-place, physical robot
-
Curiosity-driven exploration by self-supervised prediction ICML 2017
- Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, Trevor Darrell
- Key: curiosity, self-supervised inverse dynamics model
- ExpEnv: VizDoom, Super Mario Bros
-
Deep Q-learning from Demonstrations AAAI 2018
- Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, Audrunas Gruslys
- Key: combining temporal difference updates with supervised classification of the demonstrator’s actions
- ExpEnv: Atari
-
Noisy Networks For Exploration ICLR 2018
- Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, Volodymyr Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane Legg
- Key: learned parametric noise
- ExpEnv: Atari
-
Exploration by random network distillation ICLR 2018
- Yuri Burda, Harrison Edwards, Amos Storkey, Oleg Klimov
- Key: random network distillation
- ExpEnv: Atari
-
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor ICML 2018
- Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine
- Key: soft actor critic, maximum entropy, policy iteration
- ExpEnv: MuJoCo
-
Large-Scale Study of Curiosity-Driven Learning ICLR 2019
- Yuri Burda, Harri Edwards & Deepak Pathak, Amos Storkey, Trevor Darrell, Alexei A. Efros
- Key: curiosity, prediction error, purely curiosity-driven learning, feature spaces
- ExpEnv: Atari, Super Mario Bros
-
Diversity is all you need: Learning skills without a reward function ICLR 2019
- Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, Sergey Levine
- Key: maximizing an information theoretic objective, unsupervised emergence of diverse skills
- ExpEnv: MuJoCo
-
Episodic Curiosity through Reachability ICLR 2019
-
EMI: Exploration with Mutual Information ICML 2019
-
Making Efficient Use of Demonstrations to Solve Hard Exploration Problems arxiv 2019
- Caglar Gulcehre, Tom Le Paine, Bobak Shahriari, Misha Denil, Matt Hoffman, Hubert Soyer, Richard Tanburn, Steven Kapturowski, Neil Rabinowitz, Duncan Williams, Gabriel Barth-Maron, Ziyu Wang, Nando de Freitas
- Key: R2D2, makes efficient use of demonstrations, hard exploration problems
- ExpEnv: Atari
-
Optimistic Exploration even with a Pessimistic Initialisation ICLR 2020
- Tabish Rashid, Bei Peng, Wendelin Böhmer, Shimon Whiteson
- Key: pessimistically initialised Q-values, count-derived bonuses, optimism during both action selection and bootstrapping
- ExpEnv: randomised chain, Maze, Montezuma’s Revenge
-
RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments ICLR 2020
- Roberta Raileanu, Tim Rocktäschel
- Key: lead to significant changes in its learned state representation
- ExpEnv: MiniGrid
-
Never give up: Learning directed exploration strategies ICLR 2020
- Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andew Bolt, Charles Blundell
- Key: ICM+RND, different degrees of exploration/exploitation
- ExpEnv: Atari
-
Agent57: Outperforming the atari human benchmark ICML 2020
- Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Charles Blundell
- Key: parameterizes a family of policies, adaptive mechanism, state-action value function parameterization
- ExpEnv: Atari, roboschool
-
Neural Contextual Bandits with UCB-based Exploration ICML 2020
- Dongruo Zhou, Lihong Li, Quanquan Gu
- Key: stochastic contextual bandit, neural network-based random feature, near-optimal regret guarantee
- ExpEnv: contextual bandits, UCI Machine Learning Repository, MNIST
-
Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments ICLR 2021
-
First return then explore Nature 2021
- Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, Jeff Clune
- Key: detachment and derailment, remembering states, returning to them, and exploring from them
- ExpEnv: Atari, pick-and-place robotics task
ICML 2023
(Click to Collapse)
- A Study of Global and Episodic Bonuses for Exploration in Contextual MDPs
- Mikael Henaff, Minqi Jiang, Roberta Raileanu
- Key: global novelty bonuses, episodic novelty bonuses, shared structure,
- ExpEnv: Mini-Hack suite, Habitat and Montezuma’s Revenge
- Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environments
- Daniel Jarrett, Corentin Tallec, Florent Altché, Thomas Mesnard, Rémi Munos, Michal Valko
- Key: stochastic environments, disentangle “noise” from “novelty”, BYOL-Hindsight
- ExpEnv: Pycolab Maze, Atari, Bank Heist
- Representations and Exploration for Deep Reinforcement Learning using Singular Value Decomposition
- Yash Chandak, Shantanu Thakoor, Zhaohan Daniel Guo, Yunhao Tang, Remi Munos, Will Dabney, Diana Borsa
- Key: singular value decomposition, relative frequency of state visitations, scale this decomposition method to large-scale domains
- ExpEnv: DMLab-30, DM-Hard-8
- Reparameterized Policy Learning for Multimodal Trajectory Optimization
- Zhiao Huang, Litian Liang, Zhan Ling, Xuanlin Li, Chuang Gan, Hao Su
- Key: multimodal policy parameterization, a generative model of optimal trajectories
- ExpEnv: bandit, MetaWorld, 2D maze
- Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning
- Sam Lobel, Akhil Bagaria, George Konidaris
- Key: count-based exploration, veraging samples from the Rademacher distribution (or coin flips)
- ExpEnv: Atari, D4RL, FETCH
ICLR 2023
(Click to Collapse)
-
Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection (Oral: 10, 8, 8)
- Jiajun Fan, Yuzheng Zhuang, Yuecheng Liu, Jianye HAO, Bin Wang, Jiangcheng Zhu, Hao Wang, Shu-Tao Xia
- Key: Learnable Behavioral Control, hybrid behavior mapping, a unified learnable process for behavior selection, bandit-based metacontrollers
- ExpEnv: Atari
-
The Role of Coverage in Online Reinforcement Learning (Oral: 8, 8, 5)
- Tengyang Xie, Dylan J Foster, Yu Bai, Nan Jiang, Sham M. Kakade
- Key: coverage conditions, data logging distribution, sample-efficient exploration, sequential extrapolation coefficient
- ExpEnv: None
-
Near-optimal Policy Identification in Active Reinforcement Learning (oral: 8,8,8)
- Xiang Li, Viraj Mehta, Johannes Kirschner, Ian Char, Willie Neiswanger, Jeff Schneider, Andreas Krause, Ilija Bogunovic
- Key: kernelized least-squares value iteration, combines optimism with pessimism for active exploration
- ExpEnv: Cartpole, Navigation, Tracking, Rotation, Branin-Hoo, Hartmann
-
Planning Goals for Exploration (Spotlight: 8, 8, 8, 8, 6)
- Edward S. Hu, Richard Chang, Oleh Rybkin, Dinesh Jayaraman
- Key: goal-conditioned, planning exploratory goals, world models, sampling-based planning algorithms
- ExpEnv: Point Maze, Walker, Ant Maze, 3-Block Stacking
-
Pink Noise Is All You Need: Colored Noise Exploration in Deep Reinforcement Learning (Spotlight: 8, 8, 8)
- Onno Eberhard, Jakob Hollenstein, Cristina Pinneri, Georg Martius
- Key: continuous action spaces, temporally correlated noise, colored noise
- ExpEnv: DeepMind Control Suite, Atari, Adroit hand suite
-
Learning About Progress From Experts (Spotlight: 8, 8, 6)
- Jake Bruce, Ankit Anand, Bogdan Mazoure, Rob Fergus
- Key: the use of expert demonstrations, long-horizon tasks, learn a monotonically increasing function that summarizes progress.
- ExpEnv: NetHack
-
DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal Systems (Spotlight: 10, 8, 8, 8)
- Pierre Schumacher, Daniel Haeufle, Dieter Büchler, Syn Schmitt, Georg Martius
- Key: large overactuated action spaces, differential extrinsic plasticity, state-space covering exploration.
- ExpEnv: musculoskeletal systems: torquearm, arm26, humanreacher, ostrich-foraging, ostrich-run, human-run, human-hop
-
Does Zero-Shot Reinforcement Learning Exist? (Spotlight: 10, 8, 8,3)
- Ahmed Touati, Jérémy Rapin, Yann Ollivier
- Key: zero-shot RL agent, disentangle universal representation learning from exploration, SFs with Laplacian eigenfunctions.
- ExpEnv: Unsupervised RL and ExORL benchmarks
-
Human-level Atari 200x faster (Poster: 8, 8, 3)
- Steven Kapturowski, Víctor Campos, Ray Jiang, Nemanja Rakicevic, Hado van Hasselt, Charles Blundell, Adria Puigdomenech Badia
- Key: 200-fold reduction of experience, a more robust and efficient agent
- ExpEnv: Atari 57
-
Learning Achievement Structure for Structured Exploration in Domains with Sparse Reward (Poster: 8, 8, 5, 5)
- Zihan Zhou, Animesh Garg
- Key: achievement-based environments, recovered dependency graph
- ExpEnv: Crafter, TreeMaze
-
Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-Free RL (Poster: 8, 8, 6, 6)
- Ruiquan Huang, Jing Yang, Yingbin Liang
- Key: reward-free reinforcement learning, reduce the uncertainty in the estimated model with minimum number of trajectories.
- ExpEnv: tabular MDPs, Low-rank MDP
-
Latent State Marginalization as a Low-cost Approach to Improving Exploration (Poster: 6, 6, 6)
- Dinghuai Zhang, Aaron Courville, Yoshua Bengio, Qinqing Zheng, Amy Zhang, Ricky T. Q. Chen
- Key: adoption of latent variable policies within the MaxEnt framework, low-cost marginalization of the latent state
- ExpEnv: DeepMind Control Suite
-
Revisiting Curiosity for Exploration in Procedurally Generated Environments (Poster: 8, 8, 5, 3, 3)
- Kaixin Wang, Kuangqi Zhou, Bingyi Kang, Jiashi Feng, Shuicheng YAN
- Key: lifelong intrinsic rewards and episodic intrinsic rewards,the performance of all lifelong-episodic combinations
- ExpEnv: MiniGrid
-
MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations (Poster: 8, 6, 6, 6)
- Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang, Vikash Kumar, Aravind Rajeswaran
- Key: Key ingredients for leveraging demonstrations in model learning
- ExpEnv: Adroit, Meta-World, DeepMind Control Suite
-
Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective (Poster: 8, 6, 6, 6, 6)
- Raj Ghugare, Homanga Bharadhwaj, Benjamin Eysenbach, Sergey Levine, Russ Salakhutdinov
- Key: alignment between these auxiliary objectives and the RL objective, a lower bound on expected returns
- ExpEnv: model-based benchmark
-
EUCLID: Towards Efficient Unsupervised Reinforcement Learning with Multi-choice Dynamics Model (Poster: 6, 6, 6, 6)
- Yifu Yuan, Jianye HAO, Fei Ni, Yao Mu, YAN ZHENG, Yujing Hu, Jinyi Liu, Yingfeng Chen, Changjie Fan
- Key: transition dynamics modeling, multi-choice dynamics model, sampling efficiency
- ExpEnv: URLB
-
Guarded Policy Optimization with Imperfect Online Demonstrations (Oral: 8, 8, 6, 5)
- Zhenghai Xue, Zhenghao Peng, Quanyi Li, Zhihan Liu, Bolei Zhou
- Key: teacher-student shared control, safety guarantee and exploration guidance, trajectory-based value estimation
- ExpEnv: MetaDrive
NeurIPS 2022
(Click to Collapse)
-
Redeeming Intrinsic Rewards via Constrained Optimization (Poster: 8, 7, 7)
- Eric Chen, Zhang-Wei Hong, Joni Pajarinen, Pulkit Agrawal
- Key: automatically tunes the importance of the intrinsic reward, principled constrained policy optimization procedure
- ExpEnv: Atari
-
You Only Live Once: Single-Life Reinforcement Learning via Learned Reward Shaping (Poster: 6, 6, 5, 5)
- Annie S. Chen, Archit Sharma, Sergey Levine, Chelsea Finn
- Key: single-life reinforcement learning, Q-weighted adversarial learning (QWALE), distribution matching strategy
- ExpEnv: Tabletop-Organization, Pointmass, modified HalfCheetah, modified Franka-Kitchen
-
Curious Exploration via Structured World Models Yields Zero-Shot Object Manipulation (Poster: 8, 7, 6)
- Cansu Sancaktar, Sebastian Blaes, Georg Martius
- Key: self-reinforcing cycle between good models and good exploration, zero-shot generalization to downstream tasks via model-based planning
- ExpEnv: Playground, Fetch Pick & Place Construction
-
Model-based Lifelong Reinforcement Learning with Bayesian Exploration (Poster: 7, 6, 6)
- Haotian Fu, Shangqun Yu, Michael Littman, George Konidaris
- Key: hierarchical Bayesian posterior
- ExpEnv: HiP-MDP versions of Mujoco, Meta-world
-
On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL (Poster: 7, 6, 5, 5)
- Jinglin Chen, Aditya Modi, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal
- Key: sample-efficient reward-free exploration, explorability or reachability assumptions
- ExpEnv: None
-
DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning (Poster: 8, 7, 4)
- Archana Bura, Aria Hasanzadezonuzy, Dileep Kalathil, Srinivas Shakkottai, Jean-Francois Chamberland
- Key: model-based safe RL, finite-horizon Constrained Markov Decision Process, reward bonus for exploration (optimism) with a conservative constraint (pessimism)
- ExpEnv: Factored CMDP environment
-
Bayesian Optimistic Optimization: Optimistic Exploration for Model-based Reinforcement Learning
- Chenyang Wu, Tianci Li, Zongzhang Zhang, Yang Yu
- Key: Optimism in the face of uncertainty (OFU), Bayesian optimistic optimization
- ExpEnv: RiverSwim, Chain, Random MDPs.
-
Active Exploration for Inverse Reinforcement Learning (Poster: 7, 7, 7, 7)
- David Lindner, Andreas Krause, Giorgia Ramponi
- Key: actively explores an unknown environment and expert policy, does not require a generative model of the environment
- ExpEnv: Four Paths, Random MDPs, Double Chain, Chain, Gridworld
-
Exploration-Guided Reward Shaping for Reinforcement Learning under Sparse Rewards (Poster: 6, 6, 4)
- Rati Devidze, Parameswaran Kamalaruban, Adish Singla
- Key: reward shaping, intrinsic reward function, exploration-based bonuses.
- ExpEnv: Chain, Room, Linek
-
Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations (Poster: 6, 6, 5, 5)
- Albert Wilcox, Ashwin Balakrishna, Jules Dedieu, Wyame Benslimane, Daniel S. Brown, Ken Goldberg
- Key: parameter free, the maximum of the standard TD target and a Monte Carlo estimate of the reward-to-go.
- ExpEnv: Pointmass Navigation, Block Extraction, Sequential Pushing, Door Opening, Block Lifting
-
Incentivizing Combinatorial Bandit Exploration (Poster: 7, 6, 5, 3)
- Xinyan Hu, Dung Daniel Ngo, Aleksandrs Slivkins, and Zhiwei Steven Wu
- Key: incentivized exploration, large,structured action sets and highly correlated beliefs, combinatorial semi-bandits.
- ExpEnv: None
ICML 2022
(Click to Collapse)
-
From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses (Oral)
- Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, Pierre Menard
- Key: Bayes-UCBVI, regret bound, quantile of a Q-value function posterior, anticoncentration inequality for a Dirichlet weighted sum
- ExpEnv: simple tabular grid-world env, Atari
-
The Importance of Non-Markovianity in Maximum State Entropy Exploration (Oral)
- Mirco Mutti, Riccardo De Santi, Marcello Restelli
- Key: maximum state entropy exploration, non-Markovianity, finite-sample regime
- ExpEnv: 3State, River Swim
-
Phasic Self-Imitative Reduction for Sparse-Reward Goal-Conditioned Reinforcement Learning (Spotlight)
- Yunfei Li, Tian Gao, Jiaqi Yang, Huazhe Xu, Yi Wu
- Key: sparse-reward goal-conditioned, RL/SL phasic, task reduction
- ExpEnv: Sawyer Push, Ant Maze, Stacking
-
Thompson Sampling for (Combinatorial) Pure Exploration (Spotlight)
- Siwei Wang, Jun Zhu
- Key: combinatorial pure exploration, Thompson Sampling, lower complexity
- ExpEnv: combinatorial multi-armed bandit
-
Near-Optimal Algorithms for Autonomous Exploration and Multi-Goal Stochastic Shortest Path (Spotlight)
- Haoyuan Cai, Tengyu Ma, Simon Du
- Key: incremental autonomous exploration, stronger sample complexity bounds, multi-goal stochastic shortest path
- ExpEnv: hard MDP
-
Safe Exploration for Efficient Policy Evaluation and Comparison (Spotlight)
- Runzhe Wan, Branislav Kveton, Rui Song
- Key: efficient and safe data collection for bandit policy evaluation.
- ExpEnv: multi-armed bandit, contextual multi-armed bandit, linear bandits
ICLR 2022
(Click to Collapse)
-
The Information Geometry of Unsupervised Reinforcement Learning (Oral: 8, 8, 8)
- Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine
- Key: unsupervised skill discovery, mutual information objective, adversarially-chosen reward functions
- ExpEnv: None
-
When should agents explore? (Spotlight: 8, 8, 6, 6)
- Miruna Pislar, David Szepesvari, Georg Ostrovski, Diana Borsa, Tom Schaul
- Key: mode-switching, non-monolithic exploration, intra-episodic exploration
- ExpEnv: Atari
-
Learning more skills through optimistic exploration (Spotlight: 8, 8, 8, 6)
- DJ Strouse, Kate Baumli, David Warde-Farley, Vlad Mnih, Steven Hansen
- Key: discriminator disagreement intrinsic reward, information gain auxiliary objective
- ExpEnv: tabular grid world, Atari
-
Learning Long-Term Reward Redistribution via Randomized Return Decomposition (Spotlight: 8, 8, 8, 5)
- Zhizhou Ren, Ruihan Guo, Yuan Zhou, Jian Peng
- Key: sparse and delayed rewards, randomized return decomposition
- ExpEnv: MuJoCo
-
Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration (Spotlight: 8, 8, 8, 6, 6)
-
Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning (Spotlight: 8, 8, 8, 6)
- Haichao Zhang, Wei Xu, Haonan Yu
- Key: generative planning method, temporally coordinated exploration, crude initial plan
- ExpEnv: classic continuous control env, CARLA
-
Learning Altruistic Behaviours in Reinforcement Learning without External Rewards (Spotlight: 8, 8, 6, 6)
- Tim Franzmeyer, Mateusz Malinowski, João F. Henriques
- Key: altruistic behaviour, task-agnostic
- ExpEnv: grid world env, foraging, multi-agent tag
-
Anti-Concentrated Confidence Bonuses for Scalable Exploration (Poster: 8, 6, 5)
- Jordan T. Ash, Cyril Zhang, Surbhi Goel, Akshay Krishnamurthy, Sham Kakade
- Key: anti-concentrated confidence bounds, elliptical bonus
- ExpEnv: multi-armed bandit, Atari
-
Lipschitz-constrained Unsupervised Skill Discovery (Poster: 8, 6, 6, 6)
- Seohong Park, Jongwook Choi, Jaekyeom Kim, Honglak Lee, Gunhee Kim
- Key: unsupervised skill discovery, Lipschitz-constrained
- ExpEnv: MuJoCo
-
LIGS: Learnable Intrinsic-Reward Generation Selection for Multi-Agent Learning (Poster: 8, 6, 5, 5)
- David Henry Mguni, Taher Jafferjee, Jianhong Wang, Nicolas Perez-Nieves, Oliver Slumbers, Feifei Tong, Yang Li, Jiangcheng Zhu, Yaodong Yang, Jun Wang
- Key: multi-agent, coordinated exploration and behaviour, learnable intrinsic-reward generation selection, switching controls
- ExpEnv: foraging, StarCraft II
-
Multi-Stage Episodic Control for Strategic Exploration in Text Games (Spotlight: 8, 8, 6, 6)
- Jens Tuyls, Shunyu Yao, Sham M. Kakade, Karthik R Narasimhan
- Key: multi-stage approach, policy decomposition
- ExpEnv: Jericho
-
On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning (Poster: 8, 8, 5, 5)
- Che Wang, Shuhan Yuan, Kai Shao, Keith Ross
- Key: Monte Carlo exploring starts, optimal policy feed-forward MDPs
- ExpEnv: blackjack, cliff Walking
NeurIPS 2021
(Click to Collapse)
-
Interesting Object, Curious Agent: Learning Task-Agnostic Exploration (Oral: 9, 8, 8, 8)
-
Tactical Optimism and Pessimism for Deep Reinforcement Learning (Poster: 9, 7, 6, 6)
- Ted Moskovitz, Jack Parker-Holder, Aldo Pacchiano, Michael Arbel, Michael Jordan
- Key: Tactical Optimistic and Pessimistic estimation, multi-arm bandit problem
- ExpEnv: MuJoCo
-
Which Mutual-Information Representation Learning Objectives are Sufficient for Control? (Poster: 7, 6, 6, 5)
- Kate Rakelly, Abhishek Gupta,Carlos Florensa, Sergey Levine
- Key: mutual information objectives, sufficiency of a state representation
- ExpEnv: catcher, catcher-grip
-
On the Theory of Reinforcement Learning with Once-per-Episode Feedback (Poster: 6, 5, 5, 4)
- Niladri S. Chatterji, Aldo Pacchiano, Peter L. Bartlett, Michael I. Jordan
- Key: binary feedback, sublinear regret
- ExpEnv: None
-
MADE: Exploration via Maximizing Deviation from Explored Regions (Poster: 7, 7, 6, 5)
- Tianjun Zhang, Paria Rashidinejad, Jiantao Jiao, Yuandong Tian, Joseph Gonzalez, Stuart Russell
- Key: maximizing deviation from the explored regions, intrinsic reward
- ExpEnv: MiniGrid, DeepMind Control Suite
-
Adversarial Intrinsic Motivation for Reinforcement Learning (Poster: 7, 7, 6)
- Ishan Durugkar, Mauricio Tec, Scott Niekum, Peter Stone
- Key: the Wasserstein-1 distance, goal-conditioned, quasimetric, adversarial intrinsic motivation
- ExpEnv: Grid World, Fetch Robot (based on MuJoCo)
-
Information Directed Reward Learning for Reinforcement Learning (Poster: 9, 8, 7, 6)
- David Lindner, Matteo Turchetta, Sebastian Tschiatschek, Kamil Ciosek, Andreas Krause
- Key: expert queries, Bayesian model of the reward, maximize the information gain
- ExpEnv: MuJoCo
-
Dynamic Bottleneck for Robust Self-Supervised Exploration (Poster: 8, 6, 6, 6)
- Chenjia Bai, Lingxiao Wang, Lei Han, Animesh Garg, Jianye Hao, Peng Liu, Zhaoran Wang
- Key: Dynamic Bottleneck, information gain
- ExpEnv: Atari
-
Hierarchical Skills for Efficient Exploration (Poster: 7, 6, 6, 6)
- Jonas Gehring, Gabriel Synnaeve, Andreas Krause, Nicolas Usunier
- Key: hierarchical skill learning, balance between generality and specificity, skills of varying complexity
- ExpEnv: Hurdles, Limbo, Stairs, GoalWall PoleBalance (based on MuJoCo)
-
Exploration-Exploitation in Multi-Agent Competition: Convergence with Bounded Rationality (spotlight: 8, 6, 6)
- Stefanos Leonardos, Georgios Piliouras, Kelly Spendlove
- Key: competitive multi-agent, balance between game rewards and exploration costs, unique quantal-response equilibrium
- ExpEnv: Two-Agent Weighted Zero-Sum Games
-
NovelD: A Simple yet Effective Exploration Criterion (Poster: 7, 6, 6, 6)
-
Episodic Multi-agent Reinforcement Learning with Curiosity-driven Exploration (Poster: 7, 6, 6, 5)
- Lulu Zheng, Jiarui Chen, Jianhao Wang, Jiamin He, Yujing Hu, Yingfeng Chen, Changjie Fan, Yang Gao, Chongjie Zhang
- Key: episodic Multi-agent, curiosity-driven exploration, prediction errors, episodic memory
- ExpEnv: Predator-Prey, StarCraft II
-
Learning Diverse Policies in MOBA Games via Macro-Goals (Poster: 7, 6, 5, 5)
- Yiming Gao, Bei Shi, Xueying Du, Liang Wang, Guangwei Chen, Zhenjie Lian, Fuhao Qiu, Guonan Han, Weixuan Wang, Deheng Ye, Qiang Fu, Wei Yang, Lanxiao Huang
- Key: MOBA-game, policy diversity, Macro-Goals Guided framework, Meta-Controller, human demonstrations
- ExpEnv: honor of kings
-
CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery (not accepted now: 8, 8, 6, 3)
- Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, Pieter Abbeel
- Key: decomposition of the mutual information, particle estimator, contrastive learning
- ExpEnv: URLB
Contributing
Our purpose is to provide a starting paper guide to who are interested in exploration methods in RL. If you are interested in contributing, please refer to HERE for instructions in contribution.
License
Awesome Exploration RL is released under the Apache 2.0 license.