• Corpus ID: 89604531

A Hitchhiker's Guide to Statistical Comparisons of Reinforcement Learning Algorithms

  title={A Hitchhiker's Guide to Statistical Comparisons of Reinforcement Learning Algorithms},
  author={C{\'e}dric Colas and Olivier Sigaud and Pierre-Yves Oudeyer},
Consistently checking the statistical significance of experimental results is the first mandatory step towards reproducible science. [] Key Result We conclude by providing guidelines and code to perform rigorous comparisons of RL algorithm performances.

Deep Reinforcement Learning at the Edge of the Statistical Precipice

This paper argues that reliable evaluation in the few-run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field, and advocates for reporting interval estimates of aggregate performance and proposing performance profiles to account for the variability in results.


A set of metrics that quantitatively measure different aspects of reliability, both during training and after learning (on a fixed policy), and designed to be general-purpose are proposed.

Demystifying Reproducibility in Meta- and Multi-Task Reinforcement Learning

This work analyzes several design decisions each author must make when they implement a meta-RL or MTRL algorithm, and uses over 500 experiments to show that these seemingly-small details can create statistically-significant variations in a single algorithm’s performance that exceed the reported performance differences between algorithms themselves.

Detecting Rewards Deterioration in Episodic Reinforcement Learning

This paper considers an episodic framework, where the rewards within each episode are not independent, nor identically-distributed, nor Markov, and presents this problem as a multivariate mean-shift detection problem with possibly partial observations, and derives a test for this problem with optimal statistical power.

Causal Based Q-Learning

This work attacks the classic taxi problem and it is shown that using causal models in the Q-learning action selection step leads to higher and faster jump-start reward and convergence, respectively.

Evaluating the Safety of Deep Reinforcement Learning Models using Semi-Formal Verification

This paper presents a semi-formal verification approach for decision-making tasks, based on interval analysis, that addresses the computational demanding of previous verification frameworks and design metrics to measure the safety of the models.

An Evaluation of a Bootstrapped Neuro-Evolution Approach

The findings of this study indicate that BNE can be used to successfully evolve closed-loop control policies and train simulators that can beused to evaluate a partial fitness function of a policy for POMDP.

Is High Variance Unavoidable in RL? A Case Study in Continuous Control

It is argued that developing low-variance agents is an important goal for the RL community via simple modifications and one cause for these outliers is unstable network parametrization which leads to saturating nonlinearities.

Exploring Safer Behaviors for Deep Reinforcement Learning

A Safety-Oriented Search is proposed that complements Deep RL algorithms to bias the policy toward safety within an evolutionary cost optimization and leverages evolutionary exploration benefits to design a novel concept of safe mutations that use visited unsafe states to explore safer actions.

How to Make Deep RL Work in Practice

This paper investigates the influence of certain initialization, input normalization, and adaptive learning techniques on the performance of state-of-the-art RL algorithms and makes suggestions which of those techniques to use by default and highlight areas that could benefit from a solution specifically tailored to RL.



Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

The significance of hyper-parameters in policy gradients for continuous control, general variance in the algorithms, and reproducibility of reported results are investigated and the guidelines on reporting novel results as comparisons against baseline methods are provided.

Deep Reinforcement Learning that Matters

Challenges posed by reproducibility, proper experimental techniques, and reporting procedures are investigated and guidelines to make future results in deep RL more reproducible are suggested.

Addressing Function Approximation Error in Actor-Critic Methods

This paper builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation, and draws the connection between target networks and overestimation bias.

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.

Using the Student ’ s t-test with extremely small sample sizes

Researchers occasionally have to work with an extremely small sample size, defined herein as N ≤ 5. Some methodologists have cautioned against using the t-test when the sample size is extremely

An Introduction to the Bootstrap

15 Empirical Bayes Method, 2nd edition J.S. Maritz and T. Lwin (1989) Symmetric Multivariate and Related Distributions K.-T. Fang, S. Kotz and K. Ng (1989) Ieneralized Linear Models, 2nd edition P.

Rank Transformations as a Bridge between Parametric and Nonparametric Statistics

Abstract Many of the more useful and powerful nonparametric procedures may be presented in a unified manner by treating them as rank transformation procedures. Rank transformation procedures are ones

OpenAI Gym

This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software.


Any experiment may be regarded as forming an individual of a “population” of experiments which might be performed under the same conditions. A series of experiments is a sample drawn from this