• Corpus ID: 239049633

Actor-critic is implicitly biased towards high entropy optimal policies

@article{Hu2021ActorcriticII,
  title={Actor-critic is implicitly biased towards high entropy optimal policies},
  author={Yuzheng Hu and Ziwei Ji and Matus Telgarsky},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.11280}
}
We show that the simplest actor-critic method — a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration — does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like ǫ-greedy, but is moreover trained on a single trajectory with no resets. The key consequence… 
Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity
  • Yan Li, Tuo Zhao, Guanghui Lan
  • Computer Science, Mathematics
  • 2022
We propose the homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space, and study its policy convergence. We report three

References

SHOWING 1-10 OF 25 REFERENCES
A Theory of Regularized Markov Decision Processes
TLDR
A general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: a larger class of regularizers, and the general modified policy iteration approach, encompassing both policy iteration and value iteration.
Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs
TLDR
This work shows that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis, and proves fast rates of $\tilde O(1/N)$, much like results in convex optimization.
Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator
TLDR
This work bridges the gap showing that (model free) policy gradient methods globally converge to the optimal solution and are efficient (polynomially so in relevant problem dependent quantities) with regards to their sample and computational complexities.
Neural Temporal-Difference and Q-Learning Provably Converge to Global Optima.
Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function
Policy Gradient Methods for Reinforcement Learning with Function Approximation
TLDR
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Maximum a Posteriori Policy Optimisation
TLDR
This work introduces a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective and develops two off-policy algorithms that are competitive with the state-of-the-art in deep reinforcement learning.
A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation
TLDR
Finite time convergence rates for TD learning with linear function approximation are proved and the authors provide results for the case when TD is applied to a single Markovian data stream where the algorithm's updates can be severely biased.
How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?
TLDR
This work establishes sharp optimization and generalization guarantees for deep ReLU networks under various assumptions made in previous work, and shows that network width polylogarithmic in $n$ and $\epsilon^{-1}$.
The Implicit Bias of Gradient Descent on Separable Data
We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the
Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences
This paper presents “Value-Difference Based Exploration” (VDBE), a method for balancing the exploration/exploitation dilemma inherent to reinforcement learning. The proposed method adapts the
...
1
2
3
...