• Corpus ID: 235422374

Characterizing the Gap Between Actor-Critic and Policy Gradient

  title={Characterizing the Gap Between Actor-Critic and Policy Gradient},
  author={Junfeng Wen and Saurabh Kumar and Ramki Gummadi and Dale Schuurmans},
Actor-critic (AC) methods are ubiquitous in reinforcement learning. Although it is understood that AC methods are closely related to policy gradient (PG), their precise connection has not been fully characterized previously. In this paper, we explain the gap between AC and PG methods by identifying the exact adjustment to the AC objective/gradient that recovers the true policy gradient of the cumulative reward objective (PG). Furthermore, by viewing the AC method as a two-player Stackelberg… 

Figures and Tables from this paper

Analysis of a Target-Based Actor-Critic Algorithm with Linear Function Approximation
This paper proposes the first theoretical analysis of an online target-based actor-critic algorithm with linear function approximation in the discounted reward setting and establishes asymptotic convergence results for both the critic and the actor under Markovian sampling.
Stackelberg Actor-Critic: Game-Theoretic Reinforcement Learning Algorithms
A meta-framework for Stackelberg actor-critic algorithms where the leader player follows the total derivative of its objective instead of the usual individual gradient is proposed.
A Parametric Class of Approximate Gradient Updates for Policy Optimization
This work identifies a parameterized space of approximate gradient updates for policy optimization that is highly structured, yet covers both classical and recent examples, including PPO, and develops a unified perspective that re-expresses the underlying updates in terms of a limited choice of gradient form and scaling function.
Closing the Gap: Tighter Analysis of Alternating Stochastic Gradient Methods for Bilevel Problems
This paper unifies several SGD-type updates for stochastic nested problems into a single SGD approach that is term ALternating Stochastic gradient dEscenT (ALSET) method, and presents a tighter analysis of ALSET for stochy nested problems.
Reinforcement Learning for Personalized Drug Discovery and Design for Complex Diseases: A Systems Pharmacology Perspective
In this survey, state-of-the-art reinforcement learning methods and their latest applications to drug design are reviewed and the challenges on harnessing reinforcement learning for systems pharmacology and personalized medicine are discussed.


Bridging the Gap Between Value and Policy Based Reinforcement Learning
A new RL algorithm, Path Consistency Learning (PCL), is developed that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces and significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks.
Natural Actor-Critic
Combining policy gradient and Q-learning
A new technique is described that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer, and establishing an equivalency between action-value fitting techniques and actor-critic algorithms, showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms.
Derivatives of Logarithmic Stationary Distributions for Policy Gradient Reinforcement Learning
A method for estimating the log stationary state distribution derivative (LSD) as a useful form of the derivative of the stationary state distributions through backward Markov chain formulation and a temporal difference learning framework is proposed.
Equivalence Between Policy Gradients and Soft Q-Learning
There is a precise equivalence between Q-learning and policy gradient methods in the setting of entropy-regularized reinforcement learning, and it is shown that "soft" $Q-learning is exactly equivalent to a policy gradient method.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.
Policy Gradient Methods for Reinforcement Learning with Function Approximation
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Dual Representations for Dynamic Programming and Reinforcement Learning
This paper presents a modified dual of the standard linear program that guarantees a globally normalized state visit distribution is obtained, and derives novel dual forms of dynamic programming, including policy evaluation, policy iteration and value iteration to derive new forms of Sarsa and Q-learning.
Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution
It is shown that the Beta policy is bias-free and provides significantly faster convergence and higher scores over the Gaussian policy when both are used with trust region policy optimization and actor critic with experience replay, the state-of-the-art on- and off-policy stochastic methods respectively, on OpenAI Gym's and MuJoCo's continuous control environments.
Addressing Function Approximation Error in Actor-Critic Methods
This paper builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation, and draws the connection between target networks and overestimation bias.