Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

@article{Schrittwieser2020MasteringAG,
  title={Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model},
  author={Julian Schrittwieser and Ioannis Antonoglou and Thomas Hubert and Karen Simonyan and L. Sifre and Simon Schmitt and Arthur Guez and Edward Lockhart and Demis Hassabis and Thore Graepel and Timothy P. Lillicrap and David Silver},
  journal={Nature},
  year={2020},
  volume={588 7839},
  pages={
          604-609
        }
}
Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess1 and Go2, where a perfect simulator is available. However, in real-world problems, the dynamics governing the environment are often complex and unknown. Here we present the MuZero algorithm, which, by combining a tree-based search with a learned model, achieves superhuman… 

Learning to Play: Reinforcement Learning and Games

TLDR
It is shown that the methods generalize to three games, hinting at artificial general intelligence, and an argument can be made that in doing so the authors failed the Turing test, since no human can play at this level.

Continuous Control for Searching and Planning with a Learned Model

TLDR
This paper provides a way and the necessary theoretical results to extend the MuZero algorithm to more generalized environments with continuous action space and shows the proposed algorithm outperforms the soft actor-critic (SAC) algorithm, a state-of-the-art model-free deep reinforcement learning algorithm.

PlanGAN: Model-based Planning With Sparse Rewards and Multiple Goals

TLDR
This work proposes PlanGAN, a model-based algorithm specifically designed for solving multi-goal tasks in environments with sparse rewards, and indicates that it can achieve comparable performance whilst being around 4-8 times more sample efficient.

Exploiting Bias for Cooperative Planning in Multi-Agent Tree Search

TLDR
A novel method of tree search based on a mixture of the individual and joint action space, which can be interpreted as a cascading effect where agents are biased by exploration of new actions, exploitation of previously profitable ones, and recommendation provided by deep neural nets is introduced.

Hierarchical Width-Based Planning and Learning

TLDR
This paper presents a hierarchical algorithm that plans at two levels of abstraction, and shows how in combi- nation with a learned policy and a learned value function, the proposed hierarchical IW can outperform current IW- based planners in Atari games with sparse rewards.

Guided Exploration with Proximal Policy Optimization using a Single Demonstration

TLDR
This article uses a single human demonstration to solve hard-exploration problems in a three-dimensional environment with comparable difficulty using only one human demonstration, and adapts this idea and integrates it with the proximal policy optimization (PPO).

On the role of planning in model-based deep reinforcement learning

TLDR
This paper studies the performance of MuZero, a state-of-the-art model-based reinforcement learning algorithm with strong connections and overlapping components with many other MBRL algorithms, and suggests that planning alone is insufficient to drive strong generalization.

Bootstrapped model learning and error correction for planning with uncertainty in model-based RL

TLDR
This paper proposes a bootstrapped multi-headed neural network that learns the distribution of future states and rewards and introduces a global error correction filter that applies high-level constraints guided by the context provided through the predictive distribution.

Improving Model-Based Reinforcement Learning with Internal State Representations through Self-Supervision

Using a model of the environment, reinforcement learning agents can plan their future moves and achieve superhuman performance in board games like Chess, Shogi, and Go, while remaining relatively
...

References

SHOWING 1-10 OF 56 REFERENCES

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

TLDR
This paper generalizes the AlphaZero approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games, and convincingly defeated a world champion program in the games of chess and shogi (Japanese chess), as well as Go.

Grandmaster level in StarCraft II using multi-agent reinforcement learning

TLDR
The agent, AlphaStar, is evaluated, which uses a multi-agent reinforcement learning algorithm and has reached Grandmaster level, ranking among the top 0.2% of human players for the real-time strategy game StarCraft II.

Mastering the game of Go without human knowledge

TLDR
An algorithm based solely on reinforcement learning is introduced, without human data, guidance or domain knowledge beyond game rules, that achieves superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract)

TLDR
The promise of ALE is illustrated by developing and benchmarking domain-independent agents designed using well-established AI techniques for both reinforcement learning and planning, and an evaluation methodology made possible by ALE is proposed.

Observe and Look Further: Achieving Consistent Performance on Atari

TLDR
This paper proposes an algorithm that addresses three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and exploring efficiently.

TreeQN and ATreeC: Differentiable Tree Planning for Deep Reinforcement Learning

TLDR
TreeQN is proposed, a differentiable, recursive, tree-structured model that serves as a drop-in replacement for any value function network in deep RL with discrete actions and ATreeC, an actor-critic variant that augments TreeQN with a softmax layer to form a stochastic policy network.

Mastering the game of Go with deep neural networks and tree search

TLDR
Using this search algorithm, the program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.5, the first time that a computer program has defeated a human professional player in the full-sized game of Go.

Model-Based Reinforcement Learning for Atari

TLDR
Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models, is described and a comparison of several model architectures is presented, including a novel architecture that yields the best results in the authors' setting.

Learning Latent Dynamics for Planning from Pixels

TLDR
The Deep Planning Network (PlaNet) is proposed, a purely model-based agent that learns the environment dynamics from images and chooses actions through fast online planning in latent space using a latent dynamics model with both deterministic and stochastic transition components.
...