First return then explore

@article{Ecoffet2021FirstRT,
  title={First return then explore},
  author={Adrien Ecoffet and Joost Huizinga and Joel Lehman and Kenneth O. Stanley and Jeff Clune},
  journal={Nature},
  year={2021},
  volume={590 7847},
  pages={
          580-586
        }
}
Reinforcement learning promises to solve complex sequential-decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse1 and deceptive2 feedback. Avoiding these pitfalls requires a thorough exploration of the environment, but creating algorithms that can do so remains one of the central challenges of the field. Here we hypothesize that the main… 
Divide & Conquer Imitation Learning
TLDR
This paper presents a novel algorithm designed to imitate complex robotic tasks from the states of an expert trajectory based on a sequential inductive bias and shows that it imitates a non-holonomic navigation task and scales to a complex simulated robotic manipulation task with very high sample efficiency.
GAN-based Intrinsic Exploration for Sample Efficient Reinforcement Learning
TLDR
Generative Adversarial Network-based Intrinsic Reward Module is proposed that learns the distribution of the observed states and sends an intrinsic reward that is computed as high for states that are out of distribution, in order to lead agent to unexplored states.
Go-Blend Behavior and Affect
TLDR
The proposed framework introduces a new paradigm shift for affect modeling by viewing the affect modeling task as a reinforcement learning process and empowers believable AI-based game testing by providing agents that can blend and express a multitude of behavioral and affective patterns.
Procedural Content Generation: Better Benchmarks for Transfer Reinforcement Learning
TLDR
It is noted that another development, the increase in procedural content generation (PCG), can improve both benchmarking and generalization in TRL, and that Alchemy and Meta-World are emerging as interesting benchmark suites.
BeBold: Exploration Beyond the Boundary of Explored Regions
TLDR
The regulated difference of inverse visitation counts is proposed as a simple but effective criterion for IR that helps the agent explore Beyond the Boundary of explored regions and mitigates common issues in count-based methods, such as short-sightedness and detachment.
Improved Sample Complexity for Incremental Autonomous Exploration in MDPs
TLDR
A novel model-based approach that interleaves discovering new states from s0 and improving the accuracy of a model estimate that is used to compute goal-conditioned policies is introduced and is the first algorithm that can return an "/cmin-optimal policy for any cost-sensitive shortest-path problem defined on the L-reachable states with minimum cost cmin.
A Unifying Framework for Reinforcement Learning and Planning
TLDR
A unifying algorithmic framework for reinforcement learning and planning (FRAP), which identifies underlying dimensions on which MDP planning and learning algorithms have to decide and compares a variety of well-known planning, model-free and model-based RL algorithms along these dimensions.
Learning Design and Construction with Varying-Sized Materials via Prioritized Memory Resets
TLDR
This paper develops a novel technique, prioritized memory resetting (PMR), which adaptively resets the state to those most critical configurations from a replay buffer so that the robot can resume training on partial architectures instead of from scratch.
Intrinsically Motivated Goal-Conditioned Reinforcement Learning: a Short Survey
TLDR
A typology of methods where deep RL algorithms are trained to tackle the developmental robotics problem of the autonomous acquisition of open-ended repertoires of skills is proposed at the intersection of deep RL and developmental approaches.
BYOL-Explore: Exploration by Bootstrapped Prediction
TLDR
It is shown that BYOL-Explore is effective in DM-HARD-8, a challenging partially-observable continuous-action hard-exploration benchmark with visually-rich 3-D environments and achieves superhuman performance on the ten hardest exploration games in Atari while having a much simpler design than other competitive agents.
...
...

References

SHOWING 1-10 OF 64 REFERENCES
ON BONUS-BASED EXPLORATION METHODS
TLDR
The results suggest that recent gains in MONTEZUMA’S REVENGE may be better attributed to architecture change, rather than better exploration schemes; and that the real pace of progress in exploration research for Atari 2600 games may have been obfuscated by good results on a single domain.
MIME: Mutual Information Minimisation Exploration
TLDR
This work proposes a counter-intuitive solution to reinforcement learning agents that get stuck at abrupt environmental transition boundaries where an agent learns a latent representation of the environment without trying to predict the future states called Mutual Information Minimising Exploration (MIME).
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
TLDR
The MuZero algorithm is presented, which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics.
Grandmaster level in StarCraft II using multi-agent reinforcement learning
TLDR
The agent, AlphaStar, is evaluated, which uses a multi-agent reinforcement learning algorithm and has reached Grandmaster level, ranking among the top 0.2% of human players for the real-time strategy game StarCraft II.
Combining Experience Replay with Exploration by Random Network Distillation
TLDR
This work shows how to efficiently combine Intrinsic Rewards with Experience Replay in order to achieve more efficient and robust exploration (with respect to PPO/RND) and consequently better results in terms of agent performances and sample efficiency.
Deriving Subgoals Autonomously to Accelerate Learning in Sparse Reward Domains
TLDR
This work describes a new, autonomous approach for deriving subgoals from raw pixels that is more efficient than competing methods, and proposes a novel intrinsic reward scheme for exploiting the derivedSubgoals, applying it to three Atari games with sparse rewards.
Go-Explore: a New Approach for Hard-Exploration Problems
TLDR
A new algorithm called Go-Explore, which exploits the following principles to remember previously visited states, solve simulated environments through any available means, and robustify via imitation learning, which results in a dramatic performance improvement on hard-exploration problems.
Novelty Search and the Problem with Objectives
By synthesizing a growing body ofwork in search processes that are not driven by explicit objectives, this paper advances the hypothesis that there is a fundamental problem with the dominant paradigm
Solving Montezuma's Revenge with Planning and Reinforcement Learning
TLDR
This work applies planning and reinforcement learning approaches, combined with domain knowledge, to enable an agent to obtain better scores in Montezuma's Revenge, and hopes that these domain-specific algorithms can inspire better approaches to solve SDPs with sparse feedback in general.
Contingency-Aware Exploration in Reinforcement Learning
TLDR
This study develops an attentive dynamics model (ADM) that discovers controllable elements of the observations, which are often associated with the location of the character in Atari games, which confirms that contingency-awareness is indeed an extremely powerful concept for tackling exploration problems in reinforcement learning.
...
...