Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning

  title={Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning},
  author={Anton Bakhtin and David J. Wu and Adam Lerer and Jonathan Gray and Athul Paul Jacob and Gabriele Farina and Alexander H. Miller and Noam Brown},
by a planning algorithm we call DiL-piKL that regularizes a reward-maximizing policy toward a human imitation-learned policy. We prove that this is a no-regret learning algorithm under a modi-fied utility function. We then show that DiL-piKL can be extended into a self-play reinforcement learning algorithm we call RL-DiL-piKL that provides a model of human play while simultaneously training an agent that responds well to this human model. We used RL-DiL-piKL to train an agent we name Diplodocus… 

Human-level play in the game of Diplomacy by combining language models with strategic reasoning

Cicero, the first AI agent to achieve human-level performance in Diplomacy, a strategy game involving both cooperation and competition that emphasizes natural language negotiation and tactical coordination between seven players, is introduced.

Towards automating Codenames spymasters with deep reinforcement learning

Codenames is a good benchmark for both humanAI co-operation and text-based reinforcement learning, which are both several important areas of AI research.

Illusory Attacks: Detectability Matters in Adversarial Attacks on Sequential Decision-Makers

Perfect illusory attacks are introduced, a novel form of adversarial attack on sequential decision-makers that is both effective and provably statistically undetectable and the more versatile E -illusory attacks, which result in observation transitions that are consistent with the state-transition function of the environment and can be learned end-to-end.



No Press Diplomacy: Modeling Multi-Agent Gameplay

This work focuses on training an agent that learns to play the No Press version of Diplomacy where there is no dedicated communication channel between players, and presents DipNet, a neural-network-based policy model for No Press Diplomacy.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

A novel regret minimization algorithm is introduced that is regularized based on the KL divergence from an imitation-learned policy, and it is shown that using this algorithm for search in no-press Diplomacy yields a policy that matches the human prediction accuracy of imitation learning while being substantially stronger.

No-Press Diplomacy from Scratch

An algorithm for action exploration and equilibrium approximation in games with combinatorial action spaces and evidence that this agent plays a strategy that is incompatible with human-data bootstrapped agents is presented, suggesting that self play alone may be insufficient for achieving superhuman performance in Diplomacy.

Human-Level Performance in No-Press Diplomacy via Equilibrium Search

An agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via external regret minimization and achieves a rank of 23 out of 1,128 human players when playing anonymous games on a popular Diplomacy website is described.

Mastering the game of Go without human knowledge

An algorithm based solely on reinforcement learning is introduced, without human data, guidance or domain knowledge beyond game rules, that achieves superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

Learning Existing Social Conventions via Observationally Augmented Self-Play

It is observed that augmenting MARL with a small amount of imitation learning greatly increases the probability that the strategy found by MARL fits well with the existing social convention, even in an environment where standard training methods very rarely find the true convention of the agent's partners.

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

This paper generalizes the AlphaZero approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games, and convincingly defeated a world champion program in the games of chess and shogi (Japanese chess), as well as Go.

Grandmaster level in StarCraft II using multi-agent reinforcement learning

The agent, AlphaStar, is evaluated, which uses a multi-agent reinforcement learning algorithm and has reached Grandmaster level, ranking among the top 0.2% of human players for the real-time strategy game StarCraft II.

"Other-Play" for Zero-Shot Coordination

This work introduces a novel learning algorithm called other-play (OP), that enhances self-play by looking for more robust strategies, exploiting the presence of known symmetries in the underlying problem.

Evaluation of Human-AI Teams for Learned and Rule-Based Agents in Hanabi

A single-blind evaluation of teams of humans and AI agents in the cooperative card game Hanabi finds that humans have a clear preference toward a rule-based AI teammate (SmartBot) over a state-of-the-art learning-basedAI teammate (Other-Play) across nearly all subjective metrics, and generally view the learning- based agent negatively, despite no statistical difference in the game score.