• Corpus ID: 237353084

Deep Reinforcement Learning at the Edge of the Statistical Precipice

@inproceedings{Agarwal2021DeepRL,
  title={Deep Reinforcement Learning at the Edge of the Statistical Precipice},
  author={Rishabh Agarwal and Max Schwarzer and Pablo Samuel Castro and Aaron C. Courville and Marc G. Bellemare},
  booktitle={Neural Information Processing Systems},
  year={2021}
}
Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the… 

The Impact of Task Underspecification in Evaluating Deep Reinforcement Learning

This article augments DRL evaluations to consider parameterized families of MDPs, and shows that in comparison to evaluating DRL methods on select MDP instances, evaluating the MDP family often yields a substantially different relative ranking of methods, casting doubt on what methods should be considered state-of-the-art.

Mildly Conservative Q-Learning for Offline Reinforcement Learning

This paper proposes Mildly Conservative Q -learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values and theoretically shows that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OODactions.

Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning

A general method that uses the most recent high variance but unbiased on-policy rollouts to alleviate the bias of the low variance temporal difference targets and sets a new state of the art on the OpenAI gym continuous control benchmark among all algorithms that do not tune hyperparameters for each environment.

An Empirical Study of Implicit Regularization in Deep Offline RL

It is observed that a direct association exists only in restricted settings and disappears in the more extensive hyperparameter sweeps and found that bootstrapping alone is insufficient to explain the collapse of the effective rank.

The Primacy Bias in Deep Reinforcement Learning

This work proposes a simple yet generally-applicable mechanism that tackles the primacy bias of deep reinforcement learning algorithms by periodically resetting a part of the agent.

Pretraining in Deep Reinforcement Learning: A Survey

This survey seeks to systematically review existing works in pretraining for deep reinforcement learning, provide a taxonomy of these methods, discuss each sub-field, and bring attention to open problems and future directions.

SFP: State-free Priors for Exploration in Off-Policy Reinforcement Learning

This work introduces state-free priors, which directly model temporal consistency in demonstrated trajectories, and are capable of driving exploration in complex tasks, even when trained on data collected on simpler tasks, and introduces a novel integration scheme for action priors in off-policy reinforcement learning by dynamically sampling actions from a probabilistic mixture of policy and action prior.

Reward Reports for Reinforcement Learning

Taking inspiration from various contributions to the technical literature on reinforcement learning, Reward Reports are outlined as living documents that track updates to design choices and assumptions behind what a particular automated system is optimizing for.

Generalization, Mayhems and Limits in Recurrent Proximal Policy Optimization

This work highlights vital details that one must get right when adding recurrence to achieve a correct andcient implementation of Proximal Policy Optimization, namely: properly shaping the neural net’s forward pass, arranging the training data, correspondingly selecting hidden states for sequence beginnings and masks for loss computation.

Efficient Offline Policy Optimization with a Learned Model

This paper uses a regularized onestep look-ahead approach to use the learned model to construct an advantage estimation based on a one-step rollout of Monte-Carlo Tree Search with a low compute budget.
...

References

SHOWING 1-10 OF 118 REFERENCES

Deep Reinforcement Learning that Matters

Challenges posed by reproducibility, proper experimental techniques, and reporting procedures are investigated and guidelines to make future results in deep RL more reproducible are suggested.

Q-Value Weighted Regression: Reinforcement Learning with Limited Data

This work builds upon Advantage Weighted Regression (AWR), an off-policy actor-critic algorithm that performs very well on continuous control tasks, but has low sample efficiency and struggles with high-dimensional observation spaces.

A Survey on Reproducibility by Evaluating Deep Reinforcement Learning Algorithms on Real-World Robots

A rigorous and standardised evaluation approach is shown for easing the process of documentation, evaluation and fair comparison of different algorithms, where the importance of choosing the right measurement metrics and conducting proper statistics on the results is emphasised, for unbiased reporting of the results.

SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning

SUNRISE is a simple unified ensemble method, which is compatible with various off-policy RL algorithms and significantly improves the performance of existing off-Policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both low-dimensional and high-dimensional environments.

Munchausen Reinforcement Learning

It is shown that slightly modifying Deep Q-Network (DQN) in that way provides an agent that is competitive with distributional methods on Atari games, without making use of distributional RL, n-step returns or prioritized replay.

Distributional Reinforcement Learning with Quantile Regression

This paper examines methods of learning the value distribution instead of the value function in reinforcement learning, and presents a novel distributional reinforcement learning algorithm consistent with the theoretical formulation.

Let's Play Again: Variability of Deep Reinforcement Learning Agents in Atari Environments

This work makes the case for reporting post-training agent performance as a distribution, rather than a point estimate, and demonstrates the variability of common agents used in the popular OpenAI Baselines repository.

Is Deep Reinforcement Learning Really Superhuman on Atari? Leveling the playing field

This work introduces SABER, a Standardized Atari BEnchmark for general Reinforcement learning algorithms and uses it to evaluate the current state of the art, Rainbow, and introduces a human world records baseline, and argues that previous claims of expert or superhuman performance of DRL might not be accurate.

Reinforcement Learning with Unsupervised Auxiliary Tasks

This paper significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10$\times$ and averaging 87\% Expert human performance on Labyrinth.

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding

It is concluded that reinforcement learning can work robustly in conjunction with function approximators, and that there is little justification at present for avoiding the case of general λ.
...