• Corpus ID: 239049633

# Actor-critic is implicitly biased towards high entropy optimal policies

@article{Hu2021ActorcriticII,
title={Actor-critic is implicitly biased towards high entropy optimal policies},
author={Yuzheng Hu and Ziwei Ji and Matus Telgarsky},
journal={ArXiv},
year={2021},
volume={abs/2110.11280}
}
• Published 21 October 2021
• Computer Science
• ArXiv
We show that the simplest actor-critic method — a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration — does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like ǫ-greedy, but is moreover trained on a single trajectory with no resets. The key consequence…
1 Citations
Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity
• Yan Li, Tuo Zhao
• Computer Science, Mathematics
• 2022
We propose the homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space, and study its policy convergence. We report three

## References

SHOWING 1-10 OF 25 REFERENCES
A Theory of Regularized Markov Decision Processes
• Computer Science, Mathematics
ICML
• 2019
A general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: a larger class of regularizers, and the general modified policy iteration approach, encompassing both policy iteration and value iteration.
Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs
• Computer Science, Mathematics
AAAI
• 2020
This work shows that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis, and proves fast rates of $\tilde O(1/N)$, much like results in convex optimization.
• Computer Science, Mathematics
ICML
• 2018
This work bridges the gap showing that (model free) policy gradient methods globally converge to the optimal solution and are efficient (polynomially so in relevant problem dependent quantities) with regards to their sample and computational complexities.
Neural Temporal-Difference and Q-Learning Provably Converge to Global Optima.
• Mathematics
• 2019
Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function
Policy Gradient Methods for Reinforcement Learning with Function Approximation
• Mathematics, Computer Science
NIPS
• 1999
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Maximum a Posteriori Policy Optimisation
• Computer Science, Mathematics
ICLR
• 2018
This work introduces a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective and develops two off-policy algorithms that are competitive with the state-of-the-art in deep reinforcement learning.
A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation
• Computer Science, Mathematics
COLT
• 2018
Finite time convergence rates for TD learning with linear function approximation are proved and the authors provide results for the case when TD is applied to a single Markovian data stream where the algorithm's updates can be severely biased.
How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?
• Computer Science, Mathematics
ICLR
• 2021
This work establishes sharp optimization and generalization guarantees for deep ReLU networks under various assumptions made in previous work, and shows that network width polylogarithmic in $n$ and $\epsilon^{-1}$.
The Implicit Bias of Gradient Descent on Separable Data
• Mathematics, Computer Science
J. Mach. Learn. Res.
• 2018
We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the
Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences
This paper presents “Value-Difference Based Exploration” (VDBE), a method for balancing the exploration/exploitation dilemma inherent to reinforcement learning. The proposed method adapts the