• Corpus ID: 226227358

Finding the Near Optimal Policy via Adaptive Reduced Regularization in MDPs

@article{Yang2020FindingTN,
  title={Finding the Near Optimal Policy via Adaptive Reduced Regularization in MDPs},
  author={Wenhao Yang and Xiang Li and Guangzeng Xie and Zhihua Zhang},
  journal={ArXiv},
  year={2020},
  volume={abs/2011.00213}
}
Regularized MDPs serve as a smooth version of original MDPs. However, biased optimal policy always exists for regularized MDPs. Instead of making the coefficient{\lambda}of regularized term sufficiently small, we propose an adaptive reduction scheme for {\lambda} to approximate optimal policy of the original MDP. It is shown that the iteration complexity for obtaining an{\epsilon}-optimal policy could be reduced in comparison with setting sufficiently small{\lambda}. In addition, there exists… 
4 Citations

Tables from this paper

Softmax Policy Gradient Methods Can Take Exponential Time to Converge

It is demonstrated that softmax PG methods can take exponential time to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration, and the exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.

Finite-Time Analysis of Entropy-Regularized Neural Natural Actor-Critic Algorithm

It is proved that entropy regularization and averaging ensure stability by providing near-deterministic and strictly suboptimal policies and regularization leads to sharp sample complexity and network width bounds in the regularized MDPs, yielding a favorable bias-variance tradeoff in policy optimization.

The Power of Regularization in Solving Extensive-Form Games

This paper proposes a series of new algorithms based on regularizing the payoff functions of the game, and establishes a set of convergence results that strictly improve over the existing ones, with either weaker assumptions or stronger convergence guarantees.

Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization

Motivated by the algorithmic role of entropy regularization in single-agent reinforcement learning and game theory, provably efficient extragradient methods to find the quantal response equilibrium (QRE)—which are solutions to zero-sum two-player matrix games with entropy regularizations—at a linear rate are developed.

References

SHOWING 1-10 OF 33 REFERENCES

A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning

A generic method to devise regularization forms and propose off-policy actor critic algorithms in complex environment settings is provided and a full mathematical analysis of the proposed regularized MDPs are conducted.

On the Convergence of Approximate and Regularized Policy Iteration Schemes

This paper proposes the optimality-preserving regularized modified policy iteration (MPI) scheme that simultaneously provides desirable properties to intermediate policies such as targeted exploration, and guarantees convergence to the optimal policy with explicit rates depending on the decrease rate of the regularization parameter.

A unified view of entropy-regularized Markov decision processes

A general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs) is proposed, showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations.

A Theory of Regularized Markov Decision Processes

A general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: a larger class of regularizers, and the general modified policy iteration approach, encompassing both policy iteration and value iteration.

Global Optimality Guarantees For Policy Gradient Methods

This work identifies structural properties -- shared by finite MDPs and several classic control problems -- which guarantee that policy gradient objective function has no suboptimal local minima despite being non-convex.

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

This work shows that the adaptive scaling mechanism used in TRPO is in fact the natural “RL version” of traditional trust-region methods from convex analysis, and proves fast rates of Õ(1/N), much like results in convex optimization.

Dynamic policy programming

The finite-iteration and asymptotic l\infty-norm performance-loss bounds for DPP in the presence of approximation/estimation error are proved and suggest that DPP can achieve a better performance than AVI and API since it averages out the simulation noise caused by Monte-Carlo sampling throughout the learning process.

On the Global Convergence Rates of Softmax Policy Gradient Methods

It is shown that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization, which significantly expands the recent asymptotic convergence results.

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs), and shows an important interplay between estimation error, approximation error, and exploration.

Understanding the impact of entropy on policy optimization

New tools for understanding the optimization landscape are presented, it is shown that policy entropy serves as a regularizer, and the challenge of designing general-purpose policy optimization algorithms is highlighted.