• Corpus ID: 246063814

Two-Sample Testing in Reinforcement Learning

  title={Two-Sample Testing in Reinforcement Learning},
  author={Martin Waltz and Ostap Okhrin},
Value-based reinforcement-learning algorithms have shown strong performances in games, robotics, and other real-world applications. The most popular sample-based method is Q-Learning. A Q-value is the expected return for a state-action pair when following a particular policy, and the algorithm subsequently performs updates by adjusting the current Q-value towards the observed reward and the maximum of the Q-values of the next state. The procedure introduces maximization bias, and solutions like… 


Gaussian Approximation for Bias Reduction in Q-Learning
The Weighted Estimator is introduced as an effective solution to mitigate the negative effects of overestimation in Q-Learning and bounds to the bias and the variance of the weighting are provided, showing its advantages over other estimators present in literature.
Exploiting Action-Value Uncertainty to Drive Exploration in Reinforcement Learning
This paper proposes several algorithms to use Thompson Sampling in RL and deep RL in a feasible way, explaining the intuitions and theoretical considerations behind them and discussing their advantages and drawbacks, and provides an empirical evaluation on an increasingly complex set of RL problems, showing the benefit of TS w.r.t. the dimensionality of the problem.
Maxmin Q-learning: Controlling the Estimation Bias of Q-learning
This paper proposes a generalization of Q-learning, called Maxmin Q- learning, which provides a parameter to flexibly control bias, and empirically verify that the algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems.
Self-correcting Q-Learning
A new way to address the maximization bias in the form of a "self-correcting algorithm" for approximating the maximum of an expected value is introduced and it is shown theoretically that this new algorithm enjoys the same convergence guarantees as Q-learning while being more accurate.
Double Q-learning
An alternative way to approximate the maximum expected value for any set of random variables is introduced and the obtained double estimator method is shown to sometimes underestimate rather than overestimate themaximum expected value.
Deep Reinforcement Learning with Double Q-Learning
This paper proposes a specific adaptation to the DQN algorithm and shows that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.
Issues in Using Function Approximation for Reinforcement Learning
This paper gives a theoretical account of the phenomenon, deriving conditions under which one may expected it to cause learning to fail, and presents experimental results which support the theoretical findings.
Action Candidate Based Clipped Double Q-learning for Discrete and Continuous Action Tasks
Theoretically, the underestimation bias in the clipped Double Q-learning decays monotonically as the number of action candidates decreases, and the algorithm can more accurately estimate the maximum expected action value on some toy environments and yield good performance on several benchmark problems.
Bias-corrected Q-learning to control max-operator bias in Q-learning
This work presents a bias-corrected Q-learning algorithm with asymptotically unbiased resistance against the max-operator bias, and shows that the algorithm asymPTotically converges to the optimal policy, as Q- learning does.
Addressing Function Approximation Error in Actor-Critic Methods
This paper builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation, and draws the connection between target networks and overestimation bias.