Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games using Baselines

  title={Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games using Baselines},
  author={Martin Schmid and Neil Burch and Marc Lanctot and Matej Moravc{\'i}k and Rudolf Kadlec and Michael H. Bowling},
  booktitle={AAAI Conference on Artificial Intelligence},
Learning strategies for imperfect information games from samples of interaction is a challenging problem. A common method for this setting, Monte Carlo Counterfactual Regret Minimization (MCCFR), can have slow long-term convergence rates due to high variance. In this paper, we introduce a variance reduction technique (VR-MCCFR) that applies to any sampling variant of MCCFR. Using this technique, periteration estimated values and updates are reformulated as a function of sampled values and state… 

Figures and Tables from this paper

Low-Variance and Zero-Variance Baselines for Extensive-Form Games

A framework of baseline-corrected values in EFGs that generalizes the previous work is introduced, and it is shown that one particular choice of such a function --- predictive baseline --- is provably optimal under certain sampling schemes.

A Fast-Convergence Method of Monte Carlo Counterfactual Regret Minimization for Imperfect Information Dynamic Games

Semi-OS, a fast-convergence method developed from Outcome-Sampling MCCF R (OS), the most popular variant of MCCFR, is introduced and it is shown that, by selecting an appropriate discount rate, Semi-OS not only significantly speeds up the convergence rate in Leduc Poker but also statistically outperforms OS in head-to-head matches of Leduc poker, a common testbed of imperfect information games.

Double Neural Counterfactual Regret Minimization

This paper proposes a double neural representation for the imperfect information games, where one neural network represents the cumulative regret, and the other represents the average strategy, and adopts the counterfactual regret minimization algorithm to optimize this double neural representations.

Stochastic Regret Minimization in Extensive-Form Games

A new framework for developing stochastic regret minimization methods that allows for the use of any regret-minimization algorithm, coupled with any gradient estimator, and allows for instantiate several new Stochastic methods for solving sequential games.


  • Computer Science
  • 2019
This paper proposes a double neural representation for the IIGs, where one neural network represents the cumulative regret, and the other represents the average strategy, and achieves strong performance while using hundreds of times less memory than the tabular CFR.

The Advantage Regret-Matching Actor-Critic

A model-free RL algorithm, the AdvantageRegret-Matching Actor-Critic (ARMAC), which learns from sampled trajectories in a centralized training setting, without requiring the application of importance sampling commonly used in Monte Carlo counterfactual regret (CFR) minimization; hence, it does not suffer from excessive variance in large environments.

ESCHER: Eschewing Importance Sampling in Games by Computing a History Value Function to Estimate Regret

This paper proposes an unbiased model-free method, ESCHER, that is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability and is able to beat DREAM and NFSP in a head-to-head competition over 90% of the time.

RLCFR: Minimize Counterfactual Regret by Deep Reinforcement Learning


  • Computer Science
  • 2019
Single Deep CFR is introduced, a variant of Deep CFR that has a lower overall approximation error by avoiding the training of an average strategy network and is more attractive from a theoretical perspective and empirically outperforms Deep CFR with respect to exploitability and one-on-one play in poker.

Efficient CFR for Imperfect Information Games with Instant Updates

This work proposes a novel counterfactual regret minimization method with instant updates, which has a provably lower convergence bound and a provable tighter space complexity bound and converges three times faster than the hybrid method used in DeepStack.



Generalized Sampling and Variance in Counterfactual Regret Minimization

This paper generalizes MCCFR by considering any generic estimator of the sought values, and shows that any choice of an estimator can be used to probabilistically minimize regret, provided the estimator is bounded and unbiased.

Monte Carlo Sampling for Regret Minimization in Extensive Games

A general family of domain-independent CFR sample-based algorithms called Monte Carlo counterfactual regret minimization (MCCFR) is described, of which the original and poker-specific versions are special cases.

AIVAT: A New Variance Reduction Technique for Agent Evaluation in Imperfect Information Games

AIVAT is introduced, a low variance, provably unbiased value assessment tool that uses an arbitrary heuristic estimate of state value, as well as the explicit strategy of a subset of the agents, to reduce the variance both from choices by nature and by players with a known strategy.

Monte carlo sampling and regret minimization for equilibrium computation and decision-making in large extensive form games

This thesis investigates the problem of decision-making in large two-player zero-sum games using Monte Carlo sampling and regret minimization methods and develops a theory for applying counterfactual regrets minimization to a generic subset of imperfect recall games.

Variance Reduction in Monte-Carlo Tree Search

This paper examines the application of some standard techniques for variance reduction in MCTS, including common random numbers, antithetic variates and control variates, and demonstrates their efficacy on three different stochastic, single-agent settings: Pig, Can't Stop and Dominion.

Quality-based Rewards for Monte-Carlo Tree Search Simulations

This paper introduces new measures for assessing the a posteriori quality of a simulation and shows that altering the rewards of play-outs based on their assessed quality improves results in six distinct two-player games and in the General Game Playing agent CADIAPLAYER.

Solving Games with Functional Regret Estimation

A novel online learning method for minimizing regret in large extensive-form games that learns a function approximator online to estimate the regret for choosing a particular action and proves the approach sound by providing a bound relating the quality of the function approximation and regret of the algorithm.

Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines

The experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks and high-dimensional hand manipulation and synthetic tasks, and the general idea of including additional information in baselines for improved variance reduction can be extended to partially observed and multi-agent tasks.

Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

This paper examines the role of policy gradient and actor-critic algorithms in partially-observable multiagent environments and relates them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees.

The Mirage of Action-Dependent Baselines in Reinforcement Learning

The variance decomposition of the policy gradient estimator is decompose and it is numerically shown that learned state-action-dependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains.