• Corpus ID: 239049824

Independent Natural Policy Gradient Always Converges in Markov Potential Games

@inproceedings{Fox2022IndependentNP,
  title={Independent Natural Policy Gradient Always Converges in Markov Potential Games},
  author={Roy Fox and Stephen McAleer and William H. Overman and Ioannis Panageas},
  booktitle={AISTATS},
  year={2022}
}
Natural policy gradient has emerged as one of the most successful algorithms for computing optimal policies in challenging Reinforcement Learning (RL) tasks, but very little was known about its convergence prop-erties until recently. The picture becomes more blurry when it comes to multi-agent RL (MARL), where only few works have theoretical guarantees for convergence to Nash policies. In this paper, we focus on a particular class of multi-agent stochastic games called Markov Potential Games… 

Figures from this paper

Independent Policy Gradient for Large-Scale Markov Potential Games: Sharper Rates, Function Approximation, and Game-Agnostic Convergence
TLDR
To learn a Nash equilibrium of an MPG in which the size of state space and/or the number of players can be very large, new independent policy gradient algorithms are proposed that are run by all players in tandem.
On the Effect of Log-Barrier Regularization in Decentralized Softmax Gradient Play in Multiagent Systems
TLDR
This paper studies the finite time convergence of decentralized softmax gradient play in a special form of game, Markov Potential Games (MPGs), which includes the identical interest game as a special case, and introduces the log-barrier regularization to overcome these drawbacks.
Independent Natural Policy Gradient Methods for Potential Games: Finite-time Global Convergence with Entropy Regularization
TLDR
The proposed entropy-regularized NPG method enables each agent to deploy symmetric, decentralized, and multiplicative updates according to its own payoff, and it is shown that the proposed method converges to the quantal response equilibrium (QRE)—the equilibrium to the entropy- regularized game—at a sublinear rate.
Convergence and Price of Anarchy Guarantees of the Softmax Policy Gradient in Markov Potential Games
TLDR
This paper shows the asymptotic convergence of this method to a Nash equilibrium of MPGs for tabular softmax policies and derives the finite-time performance of the policy gradient in two settings: using the log-barrier regularization and using the natural policy gradient under the best-response dynamics (NPG-BR).
Self-Play PSRO: Toward Optimal Populations in Two-Player Zero-Sum Games
TLDR
Self-Play PSRO (SP-PSRO) is introduced, a method that adds an approximately optimal stochastic policy to the population in each iteration and empirically tends to converge much faster than APSRO and in many games converges in just a few iterations.
On Improving Model-Free Algorithms for Decentralized Multi-Agent Reinforcement Learning
TLDR
This paper investigates sample-efficient model-free algorithms in decentralized MARL, and proposes stage-based V-learning algorithms that significantly simplify the algorithmic design and analysis of recent works, and circumvent a rather complicated no-weighted-regret bandit subroutine.
Logit-Q Learning in Markov Games
We present new independent learning dynamics provably converging to an efficient equilibrium (also known as optimal equilibrium) maximizing the social welfare in infinite-horizon discounted
Learning in Congestion Games with Bandit Feedback
TLDR
This paper investigates congestion games, a class of games with benign theoretical structure and broad real-world applications, and proposes a centralized algorithm based on the optimism in the face of uncertainty principle and a decentralized algorithm via a novel combination of the Frank-Wolfe method and G-optimal design.
Decentralized Cooperative Reinforcement Learning with Hierarchical Information Structure
TLDR
This work considers two-agent multi-armed bandits and Markov decision processes with a hierarchical information structure arising in applications to propose simpler and more efficient algorithms that require no coordination or communication.
ESCHER: Eschewing Importance Sampling in Games by Computing a History Value Function to Estimate Regret
TLDR
This paper proposes an unbiased model-free method, ESCHER, that is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability in the tabular case and shows that a deep learning version ofESCHER outperforms the prior state of the art—DREAM and neural fictitious self play (NFSP)—and the difference becomes dramatic as game size increases.
...
...

References

SHOWING 1-10 OF 38 REFERENCES
Independent Policy Gradient Methods for Competitive Reinforcement Learning
TLDR
It is shown that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule.
Gradient Play in Multi-Agent Markov Stochastic Games: Stationary Points and Convergence
TLDR
For Markov potential games, it is proved that strict NEs are local maxima of the total potential function and fully-mixedNEs are saddle points, and a local convergence rate around strictNEs for more general settings is given.
Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games
TLDR
A novel definition of Markov Potential Games (MPG) is presented that generalizes prior attempts at capturing complex stateful multiagent coordination and proves (polynomially fast in the approximation error) convergence of independent policy gradient to Nash policies by adapting recent gradient dominance property arguments developed for single agent MDPs to multi-agent learning settings.
Neural Replicator Dynamics
TLDR
An elegant one-line change to policy gradient methods is derived that simply bypasses the gradient step through the softmax, yielding a new algorithm titled Neural Replicator Dynamics (NeuRD), which quickly adapts to nonstationarities and outperforms policy gradient significantly in both tabular and function approximation settings.
Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes
TLDR
One insight of this work is in formalizing the importance how a favorable initial state distribution provides a means to circumvent worst-case exploration issues, analogous to the global convergence guarantees of iterative value function based algorithms.
A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning
TLDR
An algorithm is described, based on approximate best responses to mixtures of policies generated using deep reinforcement learning, and empirical game-theoretic analysis to compute meta-strategies for policy selection, which generalizes previous ones such as InRL.
XDO: A Double Oracle Algorithm for Extensive-Form Games
TLDR
Extensive-Form Double Oracle (XDO), an extensive-form double oracle algorithm for two-player zero-sum games that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates, is proposed and Neural XDO (NXDO) is introduced, where the best response is learned through deep RL.
Learning in Nonzero-Sum Stochastic Games with Potentials
TLDR
This paper introduces a new generation of MARL learners that can handle nonzero-sum payoff structures and continuous settings and proves theoretically the learning method, SPot-AC, enables independent agents to learn Nash equilibrium strategies in polynomial time.
Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games
TLDR
P2SRO is introduced, the first scalable general method for finding approximate Nash equilibria in large zero-sum imperfect-information games and is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots.
Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning
TLDR
Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, thereby establishing a new state of the art in multi-agent MARL.
...
...