• Corpus ID: 248887826

Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients

@inproceedings{Saglam2021ParameterfreeRO,
  title={Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients},
  author={Baturay Saglam and Furkan Burak Mutlu and Dogan Can Cicek and Suleyman Serdar Kozat},
  year={2021}
}
—Approximation of the value functions in value- based deep reinforcement learning induces overestimation bias, resulting in suboptimal policies. We show that when the reinforcement signals received by the agents have a high variance, deep actor-critic approaches that overcome the overestimation bias lead to a substantial underestimation bias. We first address the detrimental issues in the existing approaches that aim to overcome such underestimation error. Then, through extensive statistical… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 35 REFERENCES
Estimation Error Correction in Deep Reinforcement Learning for Deterministic Actor-Critic Methods
TLDR
This work shows that in deep actor-critic methods that aim to overcome the overestimation bias, if the reinforcement signals received by the agent have a high variance, a significant underestimation bias arises, and introduces a parameter-free, novel deep Q-learning variant.
Reducing Estimation Bias via Triplet-Average Deep Deterministic Policy Gradient
TLDR
This article investigates the underestimation phenomenon in the recent twin delay deep deterministic actor-critic algorithm and theoretically demonstrates its existence, and proposes a novel triplet-averageDeep deterministic policy gradient algorithm that takes the weighted action value of three target critics to reduce the estimation bias.
Addressing Function Approximation Error in Actor-Critic Methods
TLDR
This paper builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation, and draws the connection between target networks and overestimation bias.
Maxmin Q-learning: Controlling the Estimation Bias of Q-learning
TLDR
This paper proposes a generalization of Q-learning, called Maxmin Q- learning, which provides a parameter to flexibly control bias, and empirically verify that the algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems.
WD3: Taming the Estimation Bias in Deep Reinforcement Learning
  • Qiang He, Xinwen Hou
  • Computer Science
    2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)
  • 2020
TLDR
This work proposes a novel Weighted Delayed Deep Deterministic Policy Gradient algorithm, which can reduce the estimation error and further improve the performance by weighting a pair of critics, and evaluates it in the OpenAI Gym continuous control tasks.
Off-Policy Temporal Difference Learning with Function Approximation
TLDR
The first algorithm for off-policy temporal-difference learning that is stable with linear function approximation is introduced and it is proved that, given training under any -soft policy, the algorithm converges w.p.1 to a close approximation to the action-value function for an arbitrary target policy.
Deep Reinforcement Learning with Double Q-Learning
TLDR
This paper proposes a specific adaptation to the DQN algorithm and shows that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.
Importance sampling in reinforcement learning with an estimated behavior policy
TLDR
This article studies importance sampling where the behavior policy action probabilities are replaced by their maximum likelihood estimate of these probabilities under the observed data, and shows this general technique reduces variance due to sampling error in Monte Carlo style estimators.
Off-Policy Actor-Critic with Shared Experience Replay
TLDR
This work analyzes the bias-variance tradeoffs in V- Trace, a form of importance sampling for actor-critic methods, and argues for mixing experience sampled from replay with on-policy experience, and proposes a new trust region scheme that scales effectively to data distributions where V-trace becomes unstable.
Deep Reinforcement Learning that Matters
TLDR
Challenges posed by reproducibility, proper experimental techniques, and reporting procedures are investigated and guidelines to make future results in deep RL more reproducible are suggested.
...
...