Offline Learning in Markov Games with General Function Approximation

@article{Zhang2023OfflineLI,
  title={Offline Learning in Markov Games with General Function Approximation},
  author={Yuheng Zhang and Yunru Bai and Nan Jiang},
  journal={ArXiv},
  year={2023},
  volume={abs/2302.02571}
}
We study offline multi-agent reinforcement learning (RL) in Markov games, where the goal is to learn an approximate equilibrium -- such as Nash equilibrium and (Coarse) Correlated Equilibrium -- from an offline dataset pre-collected from the game. Existing works consider relatively restricted tabular or linear models and handle each equilibria separately. In this work, we provide the first framework for sample-efficient offline learning in Markov games under general function approximation… 

Figures from this paper

Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: Single-Agent MDP and Markov Game

This paper proposes a new pessimism-based algorithm for offline linear MDPs and MGs with linear function approximation and extends the techniques to the two-player zero-sum Markov games, and establishes a new performance lower bound for MGs.

Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

This work develops provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games with simultaneous moves and proposes an optimistic variant of the least-squares minimax value iteration algorithm.

Almost Optimal Algorithms for Two-player Zero-Sum Linear Mixture Markov Games

We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves in the finite-horizon setting, where the transition kernel of the underlying Markov games can be

The Complexity of Markov Equilibrium in Stochastic Games

We show that computing approximate stationary Markov coarse correlated equilibria (CCE) in general-sum stochastic games is computationally intractable, even when there are two players, the game is

Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games

An algorithm in which each agent independently runs optimistic V-learning (a variant of Q-learning) to efficiently explore the unknown environment, while using a stabilized online mirror descent (OMD) subroutine for policy updates, appears to be the first sample complexity result for learning in generic general-sum Markov games.

Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets

A pessimism-based algorithm is proposed, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions.

Towards General Function Approximation in Zero-Sum Markov Games

In the decoupled setting where the agent controls a single player and plays against an arbitrary opponent, a new model-free algorithm is proposed and it is proved that sample complexity can be bounded by a generalization of Witness rank to Markov games.

Offline Reinforcement Learning with Realizability and Single-policy Concentrability

This paper analyzes a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables are modeled using a density-ratio function against offline data and shows that the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability.

Is Pessimism Provably Efficient for Offline RL?

A pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function and establishes a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs).

When is Offline Two-Player Zero-Sum Markov Game Solvable?

A new assumption named unilateral concentration is proposed and a pessimism-type algorithm is designed that can achieve minimax sample complexity without any modification for two widely studied settings: dataset with uniform concentration assumption and turn-based Markov games.