# Independent Policy Gradient Methods for Competitive Reinforcement Learning

@article{Daskalakis2021IndependentPG, title={Independent Policy Gradient Methods for Competitive Reinforcement Learning}, author={Constantinos Daskalakis and Dylan J. Foster and Noah Golowich}, journal={ArXiv}, year={2021}, volume={abs/2101.04233} }

We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents (i.e., zerosum stochastic games). We consider an episodic setting where in each episode, each player independently selects a policy and observes only their own actions and rewards, along with the state. We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as…

## 75 Citations

### An Independent Learning Algorithm for a Class of Symmetric Stochastic Games

- Computer ScienceArXiv
- 2021

This paper investigates the feasibility of using independent learners to find approximate equilibrium policies in non-episodic, discounted stochastic games, and presents an independent learning algorithm that comes with high probability guarantees of approximate equilibrium in this class of games.

### Independent Policy Gradient for Large-Scale Markov Potential Games: Sharper Rates, Function Approximation, and Game-Agnostic Convergence

- Computer Science, EconomicsICML
- 2022

To learn a Nash equilibrium of an MPG in which the size of state space and/or the number of players can be very large, new independent policy gradient algorithms are proposed that are run by all players in tandem.

### Gradient play in stochastic games: stationary points, convergence, and sample complexity

- Computer Science
- 2021

This work designs a sample-based reinforcement learning algorithm and gives a non-asymptotic global convergence rate analysis for both exact gradient play and the authors' sample- based learning algorithm for a subclass of SGs called Markov potential games.

### Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games

- Computer ScienceArXiv
- 2022

This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games, and proposes a single-loop policy optimization method with symmetric updates from both agents, which achieves a last-iterate linear convergence to the quantal response equilibrium of the regularized problem.

### On the Global Convergence Rates of Decentralized Softmax Gradient Play in Markov Potential Games

- Computer Science
- 2022

The established convergence rates for the unregularized cases contain a trajectory dependent constant that can be arbitrarily large, whereas the log -barrier regularization overcomes this drawback, with the cost of slightly worse dependence on other factors such as the action set size.

### Empirical Policy Optimization for n-Player Markov Games

- EconomicsIEEE transactions on cybernetics
- 2022

This paper treats the evolution of player policies as a dynamical process and proposes a novel learning scheme for Nash equilibrium, which develops the empirical policy optimization algorithm, that is implemented in a reinforcement-learning framework and runs in a distributed way, with each player optimizing its policy based on own observations.

### On Improving Model-Free Algorithms for Decentralized Multi-Agent Reinforcement Learning

- Computer ScienceICML
- 2022

This work investigates sample-efﬁcient model-free algorithms in decentralized MARL, and proposes stage-based V-learning algorithms that simplify the algorithmic design and analysis of recent works, and circumvent a rather complicated no- weighted -regret bandit subroutine.

### On the Effect of Log-Barrier Regularization in Decentralized Softmax Gradient Play in Multiagent Systems

- Computer ScienceArXiv
- 2022

This paper studies the finite time convergence of decentralized softmax gradient play in a special form of game, Markov Potential Games (MPGs), which includes the identical interest game as a special case, and introduces the log-barrier regularization to overcome these drawbacks.

### Independent Learning in Stochastic Games

- Computer ScienceArXiv
- 2021

This review paper presents the recently proposed simple and independent learning dynamics that guarantee convergence in zero-sum stochastic games, together with a review of other contemporaneous algorithms for dynamic multi-agent learning in this setting.

### Satisficing Paths and Independent Multi-Agent Reinforcement Learning in Stochastic Games

- Economics
- 2021

. In multi-agent reinforcement learning (MARL), independent learners are those that do not observe the actions of other agents in the system. Due to the decentralization of information, it is…

## References

SHOWING 1-10 OF 81 REFERENCES

### The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems

- EconomicsAAAI/IAAI
- 1998

This work distinguishes reinforcement learners that are unaware of (or ignore) the presence of other agents from those that explicitly attempt to learn the value of joint actions and the strategies of their counterparts, and proposes alternative optimistic exploration strategies that increase the likelihood of convergence to an optimal equilibrium.

### Policy-Gradient Algorithms Have No Guarantees of Convergence in Continuous Action and State Multi-Agent Settings

- EconomicsArXiv
- 2019

It is shown by counterexample that policy-gradient algorithms have no guarantees of even local convergence to Nash equilibria in continuous action and state space multi-agent settings, and a large number of general-sum linear quadratic games are generated that satisfy these conditions.

### Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

- Computer ScienceNeurIPS
- 2018

This paper examines the role of policy gradient and actor-critic algorithms in partially-observable multiagent environments and relates them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees.

### Neural Replicator Dynamics

- Computer ScienceArXiv
- 2019

An elegant one-line change to policy gradient methods is derived that simply bypasses the gradient step through the softmax, yielding a new algorithm titled Neural Replicator Dynamics (NeuRD), which quickly adapts to nonstationarities and outperforms policy gradient significantly in both tabular and function approximation settings.

### Global Optimality Guarantees For Policy Gradient Methods

- Computer ScienceArXiv
- 2019

This work identifies structural properties -- shared by finite MDPs and several classic control problems -- which guarantee that policy gradient objective function has no suboptimal local minima despite being non-convex.

### Nash Q-Learning for General-Sum Stochastic Games

- Computer Science, EconomicsJ. Mach. Learn. Res.
- 2003

This work extends Q-learning to a noncooperative multiagent context, using the framework of general-sum stochastic games, and implements an online version of Nash Q- learning that balances exploration with exploitation, yielding improved performance.

### Policy Gradient Methods for Reinforcement Learning with Function Approximation

- Computer ScienceNIPS
- 1999

This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

### Policy-Gradient Algorithms Have No Guarantees of Convergence in Linear Quadratic Games

- Economics, Computer ScienceAAMAS
- 2020

It is shown by counterexample that policy-gradient algorithms have no guarantees of even local convergence to Nash equilibria in continuous action and state space multi-agent settings, and generates a large number of general-sum linear quadratic games that satisfy these conditions.

### Proximal Policy Optimization Algorithms

- Computer ScienceArXiv
- 2017

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective…