# Actor-critic is implicitly biased towards high entropy optimal policies

@article{Hu2021ActorcriticII, title={Actor-critic is implicitly biased towards high entropy optimal policies}, author={Yuzheng Hu and Ziwei Ji and Matus Telgarsky}, journal={ArXiv}, year={2021}, volume={abs/2110.11280} }

We show that the simplest actor-critic method — a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration — does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like ǫ-greedy, but is moreover trained on a single trajectory with no resets. The key consequence…

## One Citation

Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

- Computer Science, Mathematics
- 2022

We propose the homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space, and study its policy convergence. We report three…

## References

SHOWING 1-10 OF 25 REFERENCES

A Theory of Regularized Markov Decision Processes

- Computer Science, MathematicsICML
- 2019

A general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: a larger class of regularizers, and the general modified policy iteration approach, encompassing both policy iteration and value iteration.

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

- Computer Science, MathematicsAAAI
- 2020

This work shows that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis, and proves fast rates of $\tilde O(1/N)$, much like results in convex optimization.

Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator

- Computer Science, MathematicsICML
- 2018

This work bridges the gap showing that (model free) policy gradient methods globally converge to the optimal solution and are efficient (polynomially so in relevant problem dependent quantities) with regards to their sample and computational complexities.

Neural Temporal-Difference and Q-Learning Provably Converge to Global Optima.

- Mathematics
- 2019

Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function…

Policy Gradient Methods for Reinforcement Learning with Function Approximation

- Mathematics, Computer ScienceNIPS
- 1999

This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

Maximum a Posteriori Policy Optimisation

- Computer Science, MathematicsICLR
- 2018

This work introduces a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective and develops two off-policy algorithms that are competitive with the state-of-the-art in deep reinforcement learning.

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

- Computer Science, MathematicsCOLT
- 2018

Finite time convergence rates for TD learning with linear function approximation are proved and the authors provide results for the case when TD is applied to a single Markovian data stream where the algorithm's updates can be severely biased.

How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?

- Computer Science, MathematicsICLR
- 2021

This work establishes sharp optimization and generalization guarantees for deep ReLU networks under various assumptions made in previous work, and shows that network width polylogarithmic in $n$ and $\epsilon^{-1}$.

The Implicit Bias of Gradient Descent on Separable Data

- Mathematics, Computer ScienceJ. Mach. Learn. Res.
- 2018

We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the…

Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences

- 2010

This paper presents “Value-Difference Based Exploration” (VDBE), a method for balancing the exploration/exploitation dilemma inherent to reinforcement learning. The proposed method adapts the…