Proximal Deterministic Policy Gradient

  title={Proximal Deterministic Policy Gradient},
  author={Marco Maggipinto and Gian Antonio Susto and Pratik Chaudhari},
  journal={2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
This paper introduces two simple techniques to improve off-policy Reinforcement Learning (RL) algorithms. First, we formulate off-policy RL as a stochastic proximal point iteration. The target network plays the role of the variable of optimization and the value network computes the proximal operator. Second, we exploits the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate through bootstrapping with limited increase of… 

Figures and Tables from this paper

Deep Q-Network with Proximal Iteration

A concrete application of Proximal Iteration in deep reinforcement learning endow the objective function of the Deep Q-Network agent with a proximal term to ensure that the online-network component of DQN remains in the vicinity of the target network.

Proximal Iteration for Deep Reinforcement Learning

The objective function of Deep Q-Network and Rainbow agents is endow with a proximal term to ensure robustness in presence of large noise and the resultant agents exhibit significant improvements over their original counterparts on the Atari benchmark.



Proximal Policy Optimization Algorithms

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective

P3O: Policy-on Policy-off Policy Optimization

A simple algorithm named P3O is developed that interleaves off-policy updates with on- policy updates and uses the effective sample size between the behavior policy and the target policy to control how far they can be from each other and does not introduce any additional hyper-parameters.

Policy Gradient Methods for Reinforcement Learning with Function Approximation

This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

Trust Region Policy Optimization

A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).

Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?

A fine-grained analysis of state-of-the-art methods based on key aspects of this framework: gradient estimation, value prediction, optimization landscapes, and trust region enforcement is proposed.

Continuous control with deep reinforcement learning

This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.

Smoothed Dual Embedding Control

A new reinforcement learning algorithm, called Smoothed Dual Embedding Control or SDEC, is derived to solve the saddle-point reformulation with arbitrary learnable function approximator and compares favorably to the state-of-the-art baselines on several benchmark control problems.

Addressing Function Approximation Error in Actor-Critic Methods

This paper builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation, and draws the connection between target networks and overestimation bias.

A Closer Look at Deep Policy Gradients

A fine-grained analysis of state-of-the-art methods based on key elements of this framework: gradient estimation, value prediction, and optimization landscapes shows that the behavior of deep policy gradient algorithms often deviates from what their motivating framework would predict.

Issues in Using Function Approximation for Reinforcement Learning

This paper gives a theoretical account of the phenomenon, deriving conditions under which one may expected it to cause learning to fail, and presents experimental results which support the theoretical findings.