Regularized Gradient Descent Ascent for Two-Player Zero-Sum Markov Games

@article{Zeng2022RegularizedGD,
title={Regularized Gradient Descent Ascent for Two-Player Zero-Sum Markov Games},
author={Sihan Zeng and Thinh T. Doan and Justin K. Romberg},
journal={ArXiv},
year={2022},
volume={abs/2205.13746}
}
• Published 27 May 2022
• Computer Science, Mathematics
• ArXiv
We study the problem of ﬁnding the Nash equilibrium in a two-player zero-sum Markov game. Due to its formulation as a minimax optimization program, a natural approach to solve the problem is to perform gradient descent/ascent with respect to each player in an alternating fashion. However, due to the non-convexity/non-concavity of the underlying objective function, theoretical understandings of this method are limited. In our paper, we consider solving an entropy-regularized variant of the…

Figures from this paper

• Computer Science
ArXiv
• 2022
This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games, and proposes a single-loop policy optimization method with symmetric updates from both agents, which achieves a last-iterate linear convergence to the quantal response equilibrium of the regularized problem.
• Education
ArXiv
• 2023
ING IMPERFECT INFORMATION AWAY FROM TWO-PLAYER ZERO-SUM GAMES Samuel Sokota Carnegie Mellon University ssokota@andrew.cmu.edu Ryan D’Orazio Mila, Université de Montréal ryan.dorazio@mila.quebec Chun
• Economics
ArXiv
• 2022
We propose a multi-agent reinforcement learning dynamics, and analyze its convergence properties in inﬁnite-horizon discounted Markov potential games. We focus on the independent and decentralized

References

SHOWING 1-10 OF 46 REFERENCES

• Computer Science
NeurIPS
• 2019
This paper proposes a multi-step gradient descent-ascent algorithm that finds an \varepsilon--first order stationary point of the game in \widetilde O(\varpsilon^{-3.5}) iterations, which is the best known rate in the literature.
• Computer Science
COLT
• 2020
This work develops provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games with simultaneous moves and proposes an optimistic variant of the least-squares minimax value iteration algorithm.
• Geology
AISTATS
• 2022
The stochastic bilinear minimax optimization problem is studied, an analysis of the same-sample Stochastic ExtraGradient method with constant step size is presented, and variations of the method that yield favorable convergence are presented.
• Computer Science
ArXiv
• 2021
This is the first quantitative analysis of policy gradient methods with function approximation for two-player zero-sum Markov games and thoroughly characterize the algorithms’ performance in terms of the number of samples, number of iterations, concentrability coefficients, and approximation error.
• Computer Science
ArXiv
• 2017
A general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs) is proposed, showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations.
• Computer Science
ArXiv
• 2020
A new variant of the recently introduced expected smoothness assumption which governs the behaviour of the second moment of the stochastic gradient is proposed and it is shown that this assumption is both more general and more reasonable than assumptions made in all prior work.
• Computer Science
ICML
• 2020
A proper mathematical definition of local optimality for this sequential setting---local minimax is proposed, as well as its properties and existence results are presented.
• Computer Science
ICML
• 2020
It is shown that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization, which significantly expands the recent asymptotic convergence results.
• Computer Science
NeurIPS
• 2020
It is shown that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule.
• Mathematics, Computer Science
ArXiv
• 2021
The main results reproduce the best-known convergence rates for the general policy optimization problem and how they can be used to derive a state-of-the-art rate for the online linear-quadratic regulator (LQR) controllers.