• Corpus ID: 247682346

Almost Optimal Algorithms for Two-player Zero-Sum Linear Mixture Markov Games

  title={Almost Optimal Algorithms for Two-player Zero-Sum Linear Mixture Markov Games},
  author={Zixiang Chen and Dongruo Zhou and Quanquan Gu},
  booktitle={International Conference on Algorithmic Learning Theory},
We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves in the finite-horizon setting, where the transition kernel of the underlying Markov games can be parameterized by a linear function over the current state, both players’ actions and the next state. In particular, we assume that we can control both players and aim to find the Nash Equilibrium by min-imizing the duality gap. We propose an algorithm Nash-UCRL based on the principle “Optimism-in-Face-of… 

Minimax-Optimal Multi-Agent RL in Markov Games With a Generative Model

Focusing on non-stationary Markov games, a fast learning algorithm called Q-FTRL and an adaptive sampling scheme that leverage the optimism principle in online adversarial learning (particularly the Follow-the-Regularized-Leader (F TRL) method) are developed.

Near-Optimal Learning of Extensive-Form Games with Imperfect Information

This paper presents the first line of algorithms that require only episodes of play to reach an ε -approximate Nash equilibrium in two-player zero-sum games, and achieves this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization.

Efficient Model-based Multi-agent Reinforcement Learning via Optimistic Equilibrium Computation

H-MARL (Hallucinated Multi-Agent Reinforcement Learning), a novel sample-efficient algorithm that can balance exploration and exploitation and improve the performance compared to non-optimistic exploration methods, is proposed.

One Policy is Enough: Parallel Exploration with a Single Policy is Minimax Optimal for Reward-Free Reinforcement Learning

This paper shows that using a single policy to guide exploration across all agents is sufficient and provably near-optimal for incorporating parallelism during the exploration phase and that this simple procedure is near-minimax optimal in the reward-free setting for linear MDPs.

Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a Generative Model

Focusing on non-stationary zero-sum Markov games, a learning algorithm called Nash-Q-FTRL and an adaptive sampling scheme that leverage the optimism principle in adversarial learning, with a delicate design of bonus terms that ensure certain decomposability under the FTRL dynamics.

Policy Optimization for Markov Games: Unified Framework and Faster Convergence

An algorithm framework for two-player zero-sum Markov Games in the full-information setting, where each iteration consists of a policy update step at each state using a certain matrix game algorithm, and a value update step with a certain learning rate.



Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

This work develops provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games with simultaneous moves and proposes an optimistic variant of the least-squares minimax value iteration algorithm.

Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games

This paper provides a novel and unified error propagation analysis in Lp-norm of three well-known algorithms adapted to Stochastic Games and shows that it can achieve a stationary policy which is 2γe+e′/(1-γ)2 -optimal.

Learning Nash Equilibrium for General-Sum Markov Games from Batch Data

A new definition of $\epsilon$-Nash equilibrium in MGs which grasps the strategy's quality for multiplayer games and introduces a neural network architecture named NashNetwork that successfully learns a Nash equilibrium in a generic multiplayer general-sum turn-based MG.

Nash Q-Learning for General-Sum Stochastic Games

This work extends Q-learning to a noncooperative multiagent context, using the framework of general-sum stochastic games, and implements an online version of Nash Q- learning that balances exploration with exploitation, yielding improved performance.

Solving Discounted Stochastic Two-Player Games with Near-Optimal Time and Sample Complexity

The sampling complexity of solving discounted two-player turn-based zero-sum stochastic games up to polylogarithmic factors is settled by showing how to generalize a near-optimal Q-learning based algorithms for MDP, in particular Sidford et al (2018), to two- player strategy computation algorithms.

Feature-Based Q-Learning for Two-Player Stochastic Games

This work proposes a two-player Q-learning algorithm for approximating the Nash equilibrium strategy via sampling and proves that the algorithm is guaranteed to find an $\epsilon$-optimal strategy using no more than $\tilde{\mathcal{O}}(K/(\epsil on^{2}(1-\gamma)^{4}))$ samples with high probability.

Online Reinforcement Learning in Stochastic Games

The UCSG algorithm is proposed that achieves a sublinear regret compared to the game value when competing with an arbitrary opponent, and this result improves previous ones under the same setting.

On the Use of Non-Stationary Strategies for Solving Two-Player Zero-Sum Markov Games

Non-stationary Reinforcement Learning algorithms and their theoretical guarantees to the case of discounted zero-sum Markov Games (MGs) are extended and it is shown that their performance mostly depends on the nature of the propagation error.

Minimax Sample Complexity for Turn-based Stochastic Game

This work proves that the plug-in solver approach, probably the most natural reinforcement learning algorithm, achieves minimax sample complexity for turn-based stochastic game (TBSG) by utilizing a `simulator' that allows sampling from arbitrary state-action pair.

Near-Optimal Reinforcement Learning with Self-Play

An optimistic variant of the Nash Q-learning algorithm with sample complexity $\tilde{\mathcal{O}}(SAB)$, and a new \emph{Nash V-learning}, which matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode.