• Publications
  • Influence
Counterfactual Multi-Agent Policy Gradients
TLDR
We propose a new multi-agent actor-critic method called counterfactual multi- agent (COMA) policy gradients. Expand
  • 496
  • 95
  • PDF
QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
TLDR
We propose QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. Expand
  • 256
  • 69
  • PDF
Learning to Communicate with Deep Multi-Agent Reinforcement Learning
TLDR
We introduce new environments for studying the learning of communication protocols in complex environments inspired by communication riddles and multi-agent computer vision problems with partial observability. Expand
  • 590
  • 63
  • PDF
Learning with Opponent-Learning Awareness
TLDR
We present Learning with Opponent-Learning Awareness (LOLA), a method in which each agent shapes the anticipated learning of the other agents in the environment. Expand
  • 206
  • 35
  • PDF
A theoretical and empirical analysis of Expected Sarsa
TLDR
Expected Sarsa exploits knowledge about stochasticity in the behavior policy to perform updates with lower variance. Expand
  • 131
  • 28
  • PDF
Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning
TLDR
This paper proposes two methods that address this problem: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and conditioning each agent's value function on a fingerprint that disambiguates the age of the data sampled from the replay memory. Expand
  • 259
  • 25
  • PDF
The StarCraft Multi-Agent Challenge
TLDR
We propose the StarCraft Multi-Agent Challenge (SMAC) as a benchmark problem to fill this gap. Expand
  • 75
  • 24
  • PDF
A Survey of Multi-Objective Sequential Decision-Making
TLDR
Sequential decision-making problems with multiple objectives arise naturally in practice and pose unique challenges for research in decision-theoretic planning. Expand
  • 291
  • 22
  • PDF
LipNet: End-to-End Sentence-level Lipreading
TLDR
We present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, recurrent network, and the connectionist temporal classification loss, trained entirely end toend. Expand
  • 157
  • 20
  • PDF
Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem
TLDR
We propose and analyze a new algorithm, called Relative Upper Confidence Bound (RUCB), for the K-armed dueling bandit problem (Yue et al., 2012), where the feedback comes in the form of pairwise preferences. Expand
  • 73
  • 18
  • PDF