Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

  title={Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions},
  author={Shuang Qiu and Xiaohan Wei and Jieping Ye and Zhaoran Wang and Zhuoran Yang},
While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for two-player zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures… 

Figures from this paper

Symmetric (Optimistic) Natural Policy Gradient for Multi-agent Learning with Parameter Convergence

This work investigates the global convergence of natural policy gradient algorithms in multi-agent learning, and proposes variants of the NPG algorithm, for several standard multi- agent learning scenarios: two-player zero-sum matrix and Markov games, and multi-player monotone games, with global last-iterate parameter convergence guarantees.

Zero-Determinant Strategy in Stochastic Stackelberg Asymmetric Security Game

This paper verifies the existence of ZD strategies for the defender and investigates the performance of the defender’s ZD strategy against a boundedly rational attacker, with a comparison of the SSE strategy.

Learning Stationary Nash Equilibrium Policies in n-Player Stochastic Games with Independent Chains via Dual Mirror Descent

We consider a subclass of n -player stochastic games, in which players have their own internal state/action spaces while they are coupled through their payoff functions. It is assumed that players’

Fictitious Play in Markov Games with Single Controller

Certain but important classes of strategic-form games, including zero-sum and identical-interest games, have thefictitious-play-property (FPP), i.e., beliefs formed in fictitious play dynamics always



Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

This work develops provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games with simultaneous moves and proposes an optimistic variant of the least-squares minimax value iteration algorithm.

Online Reinforcement Learning in Stochastic Games

The UCSG algorithm is proposed that achieves a sublinear regret compared to the game value when competing with an arbitrary opponent, and this result improves previous ones under the same setting.

Provable Self-Play Algorithms for Competitive Reinforcement Learning

This work introduces a self-play algorithm---Value Iteration with Upper/Lower Confidence Bound (VI-ULCB)---and shows that it achieves regret $\tilde{\mathcal{O}}(\sqrt{T})$ after playing $T$ steps of the game, and introduces an explore-then-exploit style algorithm, which achieves a slightly worse regret, but is guaranteed to run in polynomial time even in the worst case.

Actor-Critic Fictitious Play in Simultaneous Move Multistage Games

A stochastic approximation of the fictitious play process is built using an architecture inspired by actor-critic algorithms and it is proved convergence of the method towards a Nash equilibrium in both the cases of zero-sum two-player multistage games and cooperative multistages games.

Regret minimization algorithms for single-controller zero-sum stochastic games

This paper presents a class of simple suboptimal strategies that can be constructed by playing a certain repeated static game where neither player observes the specific mixed strategies used by the other player at each round, and quantifies the suboptimality of the resulting strategies.

Near-Optimal Reinforcement Learning with Self-Play

An optimistic variant of the Nash Q-learning algorithm with sample complexity $\tilde{\mathcal{O}}(SAB)$, and a new \emph{Nash V-learning}, which matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode.

Optimistic Policy Optimization with Bandit Feedback

This paper considers model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback, and proposes an optimistic trust region policy optimization (TRPO) algorithm, which establishes regret for stochastic rewards and proves regret for adversarial rewards.

Provably Efficient Exploration in Policy Optimization

This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.

A Sharp Analysis of Model-based Reinforcement Learning with Self-Play

This paper designs an algorithm for two-player zero-sum Markov games that outputs a single Markov policy with optimality guarantee, while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute.

Provably Efficient Online Agnostic Learning in Markov Games

This work studies online agnostic learning, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable, and presents an algorithm that achieves after $K$ episodes a sublinear $\tilde{\mathcal{O}}(K^{3/4})$ regret, which is the first sublinear regret bound (to the authors' knowledge) in theOnline agnostic setting.