Self-Adaptive Double Bootstrapped DDPG

  title={Self-Adaptive Double Bootstrapped DDPG},
  author={Zhuobin Zheng and Chun Yuan and Zhihui Lin and Yangyang Cheng and Hanghao Wu},
Deep Deterministic Policy Gradient (DDPG) algorithm has been successful for state-of-the-art performance in high-dimensional continuous control tasks. [] Key Method To alleviate the instability, a self-adaptive confidence mechanism is introduced to dynamically adjust the weights of bootstrapped heads and enhance the ensemble performance effectively and efficiently. We demonstrate that SOUP achieves faster learning by at least 45% while improving cumulative reward and stability substantially in comparison to…

Figures and Tables from this paper

Continuous Control With Ensemble Deep Deterministic Policy Gradients
It is shown how existing tools can be brought together in a novel way, giving rise to the Ensemble Deep Deterministic Policy Gradients (ED2) method, to yield state-of-the-art results on continuous control tasks from OpenAI Gym MuJoCo.
Conservative Policy Gradient in Multi-critic Setting
An algorithm based on TD3, conservative policy gradient (CPG), is proposed that optimizes the policy with respect to the lower bound of the two Q functions to deal with the inconsistancy.
Deep Ensemble Reinforcement Learning with Multiple Deep Deterministic Policy Gradient Algorithm
This paper proposes a framework of m-out-of-n bootstrapped and aggregated multiple deep deterministic policy gradient to accelerate the training process and increase the performance, and demonstrates that the proposed method outperforms the existing algorithms in both efficiency and performance.
A Novel Heterogeneous Actor-critic Algorithm with Recent Emphasizing Replay Memory
An off-policy heterogeneous actor-critic (HAC) algorithm, which contains soft Q- function and ordinary Q-function, which outperforms prior state-of-the-art methods in terms of training efficiency and performance and validates the effectiveness of the method.
Self-Supervised Mixture-of-Experts by Uncertainty Estimation
This paper proposes SelfSupervised Mixture-of-Experts (SUM), an effective algorithm driven by predictive uncertainty estimation for multitask RL that learns faster and achieves better performance by efficient transfer and robust generalization, outperforming several related methods on extended OpenAI Gym’s MuJoCo multi-task environments.
Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning
A new policy iteration theory is proposed as an important extension of soft policy iteration and Soft Actor-Critic (SAC), one of the most efficient model free algorithms for deep reinforcement learning, and arbitrary entropy measures that generalize Shannon entropy can be utilized to properly randomize action selection.
Exploration in Deep Reinforcement Learning: A Comprehensive Survey
A comprehensive and unified empirical comparison of different exploration methods for DRL on a set of commonly used benchmarks and summarizes the open problems of exploration in DRL and deep MARL and point out a few future directions.
Balancing Value Underestimation and Overestimationwith Realistic Actor-Critic
This work proposes a novel model-free algorithm, Realistic Actor-Critic (RAC), which aims to solve trade-offs between value underestimation and overestimation by learning a policy family concerning various confidence-bounds of Q-function, and constructs uncertainty punished Q-learning (UPQ), which uses uncertainty from the ensembling of multiple critics to control estimation bias ofQ-function.
Optimizing hyperparameters of deep reinforcement learning for autonomous driving based on whale optimization algorithm
A swarm-based optimization algorithm, namely the Whale Optimization Algorithm (WOA), is employed for optimizing the hyperparameters of the Deep Deterministic Policy Gradient (DDPG) algorithm to achieve the optimum control strategy in an autonomous driving control problem.
Sample-Efficient Learning-Based Controller For Bipedal Walking In Robotic Systems
  • Computer Science
  • 2021
This work follows the imitation learning approach of DeepMimic and uses the Proximal Policy Optimization algorithm to achieve stable and visually human-like forward walking in 3D and develops a new metric for measuring the sample efficiency of an algorithm in the considered context.


Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
© ICLR 2019 - Conference Track Proceedings. All rights reserved. Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major
Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning
The main incentive of this work is to keep the advantages of model-free Q-learning while minimizing real-world interaction by the employment of a dynamics model learned in parallel, to counteract adverse effects of imaginary rollouts with an inaccurate model.
Deep Exploration via Bootstrapped DQN
Efficient exploration in complex environments remains a major challenge for reinforcement learning. We propose bootstrapped DQN, a simple algorithm that explores in a computationally and
Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control
The significance of hyper-parameters in policy gradients for continuous control, general variance in the algorithms, and reproducibility of reported results are investigated and the guidelines on reporting novel results as comparisons against baseline methods are provided.
Deep Reinforcement Learning that Matters
Challenges posed by reproducibility, proper experimental techniques, and reporting procedures are investigated and guidelines to make future results in deep RL more reproducible are suggested.
Continuous control with deep reinforcement learning
This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
Reinforcement Learning with Unsupervised Auxiliary Tasks
This paper significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10$\times$ and averaging 87\% Expert human performance on Labyrinth.
Parameter Space Noise for Exploration
This work demonstrates that RL with parameter noise learns more efficiently than traditional RL with action space noise and evolutionary strategies individually through experimental comparison of DQN, DDPG, and TRPO on high-dimensional discrete action environments as well as continuous control tasks.
Curiosity-Driven Exploration by Self-Supervised Prediction
This work forms curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model, which scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and ignores the aspects of the environment that cannot affect the agent.
End-to-End Training of Deep Visuomotor Policies
This paper develops a method that can be used to learn policies that map raw image observations directly to torques at the robot's motors, trained using a partially observed guided policy search method, with supervision provided by a simple trajectory-centric reinforcement learning method.