• Corpus ID: 237353084

# Deep Reinforcement Learning at the Edge of the Statistical Precipice

@inproceedings{Agarwal2021DeepRL,
title={Deep Reinforcement Learning at the Edge of the Statistical Precipice},
author={Rishabh Agarwal and Max Schwarzer and Pablo Samuel Castro and Aaron C. Courville and Marc G. Bellemare},
booktitle={Neural Information Processing Systems},
year={2021}
}
• Published in
Neural Information Processing…
30 August 2021
• Computer Science
Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the…
158 Citations
• Computer Science
ArXiv
• 2022
This article augments DRL evaluations to consider parameterized families of MDPs, and shows that in comparison to evaluating DRL methods on select MDP instances, evaluating the MDP family often yields a substantially different relative ranking of methods, casting doubt on what methods should be considered state-of-the-art.
• Computer Science
ArXiv
• 2022
This paper proposes Mildly Conservative Q -learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values and theoretically shows that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OODactions.
• Computer Science
ArXiv
• 2022
It is observed that a direct association exists only in restricted settings and disappears in the more extensive hyperparameter sweeps and found that bootstrapping alone is insuﬃcient to explain the collapse of the eﬀective rank.
• Computer Science
IEEE Robotics and Automation Letters
• 2023
A general method called Adaptively Calibrated Critics (ACC) is proposed that uses the most recent high variance but unbiased on-policy rollouts to alleviate the bias of the low variance temporal difference targets.
• Computer Science
ArXiv
• 2022
A sequential approach to evaluate ofﬂine RL algorithms as a function of the training set size and thus by their data efﬁciency is proposed, which provides valuable insights into the dataency of the learning process and the robustness of algorithms to distribution changes in the dataset.
• Computer Science
ICML
• 2022
This work proposes a simple yet generally-applicable mechanism that tackles the primacy bias of deep reinforcement learning algorithms by periodically resetting a part of the agent.
• Computer Science
ArXiv
• 2022
This survey seeks to systematically review existing works in pretraining for deep reinforcement learning, provide a taxonomy of these methods, discuss each sub-ﬁeld, and bring attention to open problems and future directions.
• Computer Science
• 2022
This work introduces state-free priors, which directly model temporal consistency in demonstrated trajectories, and are capable of driving exploration in complex tasks, even when trained on data collected on simpler tasks, and introduces a novel integration scheme for action priors in oﬀ-policy reinforcement learning by dynamically sampling actions from a probabilistic mixture of policy and action prior.
• Computer Science
• 2022
As deep RL research move towards more complex and challenging benchmarks, the computational barrier to entry in RL research would be even substantially higher, due to the inefficiency of tabula rasa RL.
• Computer Science
ArXiv
• 2022
Taking inspiration from various contributions to the technical literature on reinforcement learning, Reward Reports are outlined as living documents that track updates to design choices and assumptions behind what a particular automated system is optimizing for.

## References

SHOWING 1-10 OF 118 REFERENCES

• Computer Science
AAAI
• 2018
Challenges posed by reproducibility, proper experimental techniques, and reporting procedures are investigated and guidelines to make future results in deep RL more reproducible are suggested.
• Computer Science
2022 International Joint Conference on Neural Networks (IJCNN)
• 2022
This work builds upon Advantage Weighted Regression (AWR), an off-policy actor-critic algorithm that performs very well on continuous control tasks, but has low sample efficiency and struggles with high-dimensional observation spaces.
• Computer Science
AAAI
• 2018
This paper examines methods of learning the value distribution instead of the value function in reinforcement learning, and presents a novel distributional reinforcement learning algorithm consistent with the theoretical formulation.
• Computer Science
CoRL
• 2019
A rigorous and standardised evaluation approach is shown for easing the process of documentation, evaluation and fair comparison of different algorithms, where the importance of choosing the right measurement metrics and conducting proper statistics on the results is emphasised, for unbiased reporting of the results.
• Computer Science
ICML
• 2021
SUNRISE is a simple unified ensemble method, which is compatible with various off-policy RL algorithms and significantly improves the performance of existing off-Policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both low-dimensional and high-dimensional environments.
• Computer Science
NeurIPS
• 2020
It is shown that slightly modifying Deep Q-Network (DQN) in that way provides an agent that is competitive with distributional methods on Atari games, without making use of distributional RL, n-step returns or prioritized replay.
• Computer Science
ArXiv
• 2019
This work makes the case for reporting post-training agent performance as a distribution, rather than a point estimate, and demonstrates the variability of common agents used in the popular OpenAI Baselines repository.
• Computer Science
ICLR
• 2017
This paper significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10$\times$ and averaging 87\% Expert human performance on Labyrinth.
It is concluded that reinforcement learning can work robustly in conjunction with function approximators, and that there is little justification at present for avoiding the case of general λ.
• Computer Science
AAMAS
• 2021
It is shown that learning an adequately diverse set of policies is required for a good ensemble while extreme diversity can prove detrimental to overall performance and this framework is seen to outperform state of the art SOTA scores in Atari 2600 and Mujoco.