Corpus ID: 224705976

Softmax Deep Double Deterministic Policy Gradients

  title={Softmax Deep Double Deterministic Policy Gradients},
  author={Ling Pan and Qingpeng Cai and Longbo Huang},
A widely-used actor-critic reinforcement learning algorithm for continuous control, Deep Deterministic Policy Gradients (DDPG), suffers from the overestimation problem, which can negatively affect the performance. Although the state-of-the-art Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm mitigates the overestimation issue, it can lead to a large underestimation bias. In this paper, we propose to use the Boltzmann softmax operator for value function estimation in continuous… Expand
Variance aware reward smoothing for deep reinforcement learning
  • Yunlong Dong, Shengjun Zhang, Xing Liu, Yu Zhang, Tan Shen
  • Computer Science
  • Neurocomputing
  • 2021
This paper investigates a common phenomenon called rewards drop at the late-stage RL training session, where the rewards trajectory oscillates dramatically, and proposes a novel rewards shaping technique named Variance Aware Rewards Smoothing (VAR). Expand
Softmax with Regularization: Better Value Estimation in Multi-Agent Reinforcement Learning
This work empirically demonstrate that QMIX, a popular Q-learning algorithm for cooperative multi-agent reinforcement learning (MARL), suffers from a particularly severe overestimation problem which is not mitigated by existing approaches, and designs a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline and demonstrates its effectiveness in stabilizing learning. Expand
Efficient Continuous Control with Double Actors and Regularized Critics
The bias alleviation property of double actors is uncovered and demonstrated by building double actors upon single critic and double critics to handle overestimation bias in DDPG and underestimation biases in TD3 respectively and it is interestingly found that double actors help improve the exploration ability of the agent. Expand
Bagged Critic for Continuous Control
Actor-critic methods have been successfully applied to several high dimensional continuous control tasks. Despite their success, they are prone to overestimation bias that leads to sub-optimalExpand
Automating Control of Overestimation Bias for Continuous Reinforcement Learning
A simple data-driven approach for guiding bias correction that can adjust the bias correction across environments automatically and eliminates the need for an extensive hyperparameter search, significantly reducing the actual number of interactions and computation. Expand
Replicating Softmax Deep Double Deterministic Policy Gradients
  • 2021
We compare the performance of TD3 and SD3 on a variety of continuous control tasks. We use the authors’ PyTorch 8 code but also provide Tensorflow implementations of SD3 and TD3 (which we did not useExpand
Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning
This paper proposes a novel offline RL algorithm, named Implicit Constraint Q-learning (ICQ), which effectively alleviates the extrapolation error by only trusting the state-action pairs given in the dataset for value estimation and extends ICQ to multi-agent tasks by decomposing the joint-policy under the implicit constraint. Expand
Regularized Softmax Deep Multi-Agent $Q$-Learning
This work empirically demonstrate that QMIX, a popular Q-learning algorithm for cooperative multiagent reinforcement learning (MARL), suffers from a more severe overestimation in practice than previously acknowledged, and is not mitigated by existing approaches, and proposes a novel regularization-based update scheme that penalizes large joint action-values that deviate from a baseline and demonstrates its effectiveness in stabilizing learning. Expand


Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods. Expand
Reinforcement Learning with Dynamic Boltzmann Softmax Updates
The DBS-DQN algorithm is proposed by applying dynamic Boltzmann softmax updates in deep Q-network, which outperforms DQN substantially in 40 out of 49 Atari games. Expand
Expected Policy Gradients
A new general policy gradient theorem is established, of which the stochastic and deterministic policy gradient theorems are special cases, and it is proved that EPG reduces the variance of the gradient estimates without requiring deterministic policies and, for the Gaussian case, with no computational overhead. Expand
Revisiting the Softmax Bellman Operator: New Benefits and New Perspective
It is shown that the softmax operator can reduce the overestimation error, which may give some insight into why a sub-optimal operator leads to better performance in the presence of value function approximation. Expand
Better Exploration with Optimistic Actor-Critic
A new algorithm, Optimistic Actor Critic, is introduced, which approximates a lower and upper confidence bound on the state-action value function and allows the principle of optimism in the face of uncertainty to perform directed exploration using the upper bound while still using the lower bound to avoid overestimation. Expand
Evolution-Guided Policy Gradient in Reinforcement Learning
Evolutionary Reinforcement Learning (ERL), a hybrid algorithm that leverages the population of an EA to provide diversified data to train an RL agent, and reinserts the RL agent into theEA population periodically to inject gradient information into the EA. Expand
Continuous Deep Q-Learning with Model-based Acceleration
This paper derives a continuous variant of the Q-learning algorithm, which it is called normalized advantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods, and substantially improves performance on a set of simulated robotic control tasks. Expand
Towards Characterizing Divergence in Deep Q-Learning
An algorithm is developed which permits stable deep Q-learning for continuous control without any of the tricks conventionally used (such as target networks, adaptive gradient optimizers, or using multiple Q functions). Expand
Smoothed Action Value Functions for Learning Gaussian Policies
This work proposes a new notion of action value defined by a Gaussian smoothed version of the expected Q-value, and shows that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Expand
Double Q-learning
An alternative way to approximate the maximum expected value for any set of random variables is introduced and the obtained double estimator method is shown to sometimes underestimate rather than overestimate themaximum expected value. Expand