# Softmax Deep Double Deterministic Policy Gradients

@article{Pan2020SoftmaxDD, title={Softmax Deep Double Deterministic Policy Gradients}, author={Ling Pan and Qingpeng Cai and Longbo Huang}, journal={ArXiv}, year={2020}, volume={abs/2010.09177} }

A widely-used actor-critic reinforcement learning algorithm for continuous control, Deep Deterministic Policy Gradients (DDPG), suffers from the overestimation problem, which can negatively affect the performance. Although the state-of-the-art Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm mitigates the overestimation issue, it can lead to a large underestimation bias. In this paper, we propose to use the Boltzmann softmax operator for value function estimation in continuous… Expand

#### Figures and Tables from this paper

#### 8 Citations

Variance aware reward smoothing for deep reinforcement learning

- Computer Science
- Neurocomputing
- 2021

This paper investigates a common phenomenon called rewards drop at the late-stage RL training session, where the rewards trajectory oscillates dramatically, and proposes a novel rewards shaping technique named Variance Aware Rewards Smoothing (VAR). Expand

Softmax with Regularization: Better Value Estimation in Multi-Agent Reinforcement Learning

- Computer Science
- ArXiv
- 2021

This work empirically demonstrate that QMIX, a popular Q-learning algorithm for cooperative multi-agent reinforcement learning (MARL), suffers from a particularly severe overestimation problem which is not mitigated by existing approaches, and designs a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline and demonstrates its effectiveness in stabilizing learning. Expand

Efficient Continuous Control with Double Actors and Regularized Critics

- Computer Science
- ArXiv
- 2021

The bias alleviation property of double actors is uncovered and demonstrated by building double actors upon single critic and double critics to handle overestimation bias in DDPG and underestimation biases in TD3 respectively and it is interestingly found that double actors help improve the exploration ability of the agent. Expand

Bagged Critic for Continuous Control

- 2021

Actor-critic methods have been successfully applied to several high dimensional continuous control tasks. Despite their success, they are prone to overestimation bias that leads to sub-optimal… Expand

Automating Control of Overestimation Bias for Continuous Reinforcement Learning

- Computer Science, Mathematics
- ArXiv
- 2021

A simple data-driven approach for guiding bias correction that can adjust the bias correction across environments automatically and eliminates the need for an extensive hyperparameter search, significantly reducing the actual number of interactions and computation. Expand

Replicating Softmax Deep Double Deterministic Policy Gradients

- 2021

We compare the performance of TD3 and SD3 on a variety of continuous control tasks. We use the authors’ PyTorch 8 code but also provide Tensorflow implementations of SD3 and TD3 (which we did not use… Expand

Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning

- Computer Science
- ArXiv
- 2021

This paper proposes a novel offline RL algorithm, named Implicit Constraint Q-learning (ICQ), which effectively alleviates the extrapolation error by only trusting the state-action pairs given in the dataset for value estimation and extends ICQ to multi-agent tasks by decomposing the joint-policy under the implicit constraint. Expand

Regularized Softmax Deep Multi-Agent $Q$-Learning

- Computer Science
- 2021

This work empirically demonstrate that QMIX, a popular Q-learning algorithm for cooperative multiagent reinforcement learning (MARL), suffers from a more severe overestimation in practice than previously acknowledged, and is not mitigated by existing approaches, and proposes a novel regularization-based update scheme that penalizes large joint action-values that deviate from a baseline and demonstrates its effectiveness in stabilizing learning. Expand

#### References

SHOWING 1-10 OF 41 REFERENCES

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

- Computer Science, Mathematics
- ICML
- 2018

This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods. Expand

Reinforcement Learning with Dynamic Boltzmann Softmax Updates

- Computer Science, Mathematics
- ArXiv
- 2019

The DBS-DQN algorithm is proposed by applying dynamic Boltzmann softmax updates in deep Q-network, which outperforms DQN substantially in 40 out of 49 Atari games. Expand

Expected Policy Gradients

- Computer Science, Mathematics
- AAAI
- 2018

A new general policy gradient theorem is established, of which the stochastic and deterministic policy gradient theorems are special cases, and it is proved that EPG reduces the variance of the gradient estimates without requiring deterministic policies and, for the Gaussian case, with no computational overhead. Expand

Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

- Computer Science
- ICML
- 2019

It is shown that the softmax operator can reduce the overestimation error, which may give some insight into why a sub-optimal operator leads to better performance in the presence of value function approximation. Expand

Better Exploration with Optimistic Actor-Critic

- Computer Science, Mathematics
- NeurIPS
- 2019

A new algorithm, Optimistic Actor Critic, is introduced, which approximates a lower and upper confidence bound on the state-action value function and allows the principle of optimism in the face of uncertainty to perform directed exploration using the upper bound while still using the lower bound to avoid overestimation. Expand

Evolution-Guided Policy Gradient in Reinforcement Learning

- Computer Science, Mathematics
- NeurIPS
- 2018

Evolutionary Reinforcement Learning (ERL), a hybrid algorithm that leverages the population of an EA to provide diversified data to train an RL agent, and reinserts the RL agent into theEA population periodically to inject gradient information into the EA. Expand

Continuous Deep Q-Learning with Model-based Acceleration

- Computer Science
- ICML
- 2016

This paper derives a continuous variant of the Q-learning algorithm, which it is called normalized advantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods, and substantially improves performance on a set of simulated robotic control tasks. Expand

Towards Characterizing Divergence in Deep Q-Learning

- Computer Science, Mathematics
- ArXiv
- 2019

An algorithm is developed which permits stable deep Q-learning for continuous control without any of the tricks conventionally used (such as target networks, adaptive gradient optimizers, or using multiple Q functions). Expand

Smoothed Action Value Functions for Learning Gaussian Policies

- Computer Science, Mathematics
- ICML
- 2018

This work proposes a new notion of action value defined by a Gaussian smoothed version of the expected Q-value, and shows that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Expand

Double Q-learning

- Computer Science, Mathematics
- NIPS
- 2010

An alternative way to approximate the maximum expected value for any set of random variables is introduced and the obtained double estimator method is shown to sometimes underestimate rather than overestimate themaximum expected value. Expand