# Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

@article{Qiu2022ProvablyEF,
title={Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions},
author={Shuang Qiu and Xiaohan Wei and Jieping Ye and Zhaoran Wang and Zhuoran Yang},
journal={ArXiv},
year={2022},
volume={abs/2207.12463}
}
• Published 25 July 2022
• Computer Science
• ArXiv
While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for two-player zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures…

## Figures from this paper

• Computer Science
ArXiv
• 2022
This work investigates the global convergence of natural policy gradient algorithms in multi-agent learning, and proposes variants of the NPG algorithm, for several standard multi- agent learning scenarios: two-player zero-sum matrix and Markov games, and multi-player monotone games, with global last-iterate parameter convergence guarantees.
• Computer Science
ArXiv
• 2023
This paper verifies the existence of ZD strategies for the defender and investigates the performance of the defender’s ZD strategy against a boundedly rational attacker, with a comparison of the SSE strategy.
We consider a subclass of n -player stochastic games, in which players have their own internal state/action spaces while they are coupled through their payoff functions. It is assumed that players’
• Economics
EC
• 2022
Certain but important classes of strategic-form games, including zero-sum and identical-interest games, have thefictitious-play-property (FPP), i.e., beliefs formed in fictitious play dynamics always

## References

SHOWING 1-10 OF 59 REFERENCES

• Computer Science
COLT
• 2020
This work develops provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games with simultaneous moves and proposes an optimistic variant of the least-squares minimax value iteration algorithm.
• Computer Science
NIPS
• 2017
The UCSG algorithm is proposed that achieves a sublinear regret compared to the game value when competing with an arbitrary opponent, and this result improves previous ones under the same setting.
• Computer Science
ICML
• 2020
This work introduces a self-play algorithm---Value Iteration with Upper/Lower Confidence Bound (VI-ULCB)---and shows that it achieves regret $\tilde{\mathcal{O}}(\sqrt{T})$ after playing $T$ steps of the game, and introduces an explore-then-exploit style algorithm, which achieves a slightly worse regret, but is guaranteed to run in polynomial time even in the worst case.
• Economics
AISTATS
• 2018
A stochastic approximation of the fictitious play process is built using an architecture inspired by actor-critic algorithms and it is proved convergence of the method towards a Nash equilibrium in both the cases of zero-sum two-player multistage games and cooperative multistages games.
• Economics
2016 IEEE 55th Conference on Decision and Control (CDC)
• 2016
This paper presents a class of simple suboptimal strategies that can be constructed by playing a certain repeated static game where neither player observes the specific mixed strategies used by the other player at each round, and quantifies the suboptimality of the resulting strategies.
• Computer Science
NeurIPS
• 2020
An optimistic variant of the Nash Q-learning algorithm with sample complexity $\tilde{\mathcal{O}}(SAB)$, and a new \emph{Nash V-learning}, which matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode.
• Computer Science
ICML
• 2020
This paper considers model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback, and proposes an optimistic trust region policy optimization (TRPO) algorithm, which establishes regret for stochastic rewards and proves regret for adversarial rewards.
• Computer Science
ICML
• 2020
This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.
• Computer Science
ICML
• 2021
This paper designs an algorithm for two-player zero-sum Markov games that outputs a single Markov policy with optimality guarantee, while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute.
• Computer Science
ArXiv
• 2020
This work studies online agnostic learning, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable, and presents an algorithm that achieves after $K$ episodes a sublinear $\tilde{\mathcal{O}}(K^{3/4})$ regret, which is the first sublinear regret bound (to the authors' knowledge) in theOnline agnostic setting.