• Corpus ID: 237213211

Settling the Variance of Multi-Agent Policy Gradients

  title={Settling the Variance of Multi-Agent Policy Gradients},
  author={Jakub Grudzien Kuba and Muning Wen and Yaodong Yang and Linghui Meng and Shangding Gu and Haifeng Zhang and David Henry Mguni and Jun Wang},
Policy gradient (PG) methods are popular reinforcement learning (RL) methods where a baseline is often applied to reduce the variance of gradient estimates. In multi-agent RL (MARL), although the PG theorem can be naturally extended, the effectiveness of multi-agent PG (MAPG) methods degrades as the variance of gradient estimates increases rapidly with the number of agents. In this paper , we offer a rigorous analysis of MAPG methods by, firstly, quantifying the contributions of the number of… 

Figures and Tables from this paper

Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning
Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, thereby establishing a new state of the art in multi-agent MARL.
Understanding Value Decomposition Algorithms in Deep Cooperative Multi-Agent Reinforcement Learning
This paper introduces the set of cooperative games in which the value decomposition methods find their validity, which is referred as decomposable games and theoretically proves that applying the multi-agent fitted Q-Iteration algorithm (MA-FQI) will lead to an optimal Qfunction.
Multi-Agent Constrained Policy Optimisation
Two algorithms are proposed, Multi-Agent Constrained Policy Optimisation (MACPO) and MAPPO-Lagrangian, which leverage the theories from both constrained policy optimisation and multi-agent trust region learning and enjoy theoretical guarantees of both monotonic improvement in reward and satisfaction of safety constraints at every iteration.
Multi-Agent Reinforcement Learning is a Sequence Modeling Problem
A novel architecture named Multi-Agent Transformer is introduced that effectively casts cooperative multi-agent reinforcement learning (MARL) into SM problems wherein the task is to map agents’ observation sequence to agent’s optimal action sequence and endows MAT with monotonic performance improvement guarantee.
LIGS: Learnable Intrinsic-Reward Generation Selection for Multi-Agent Learning
A new general framework for improving coordination and performance of multi-agent reinforcement learners (MARL), named Learnable Intrinsic-Reward Generation Selection algorithm (LIGS), which introduces an adaptive learner, Generator that observes the agents and learns to construct intrinsic rewards online that coordinate the agents’ joint exploration and joint behaviour.
Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks
This work introduces the first offline MARL dataset with diverse quality levels based on the StarCraftII environment, and proposes the novel architecture of multi-agent decision transformer (MADT) for effective offline learning.
A Review of Safe Reinforcement Learning: Methods, Theory and Applications
A review of the progress of safe RL from the perspectives of methods, theory and applications, and problems that are crucial for safe RL being deployed in real-world applications, coined as “2H3W” are reviewed.
Decentralized Multi-Agent Control of a Manipulator in Continuous Task Learning
The achieved results show that the proposed decentralized approach to the robot control action learning and (re)execution considering a generic multi-DoF manipulator is capable of accelerating the learning process at the beginning with respect to the single-agent framework while also reducing the computational effort.


Off-Policy Multi-Agent Decomposed Policy Gradients
This paper investigates causes that hinder the performance of MAPG algorithms and presents a multi-agent decomposed policy gradient method (DOP), which introduces the idea of value function decomposition into the multi- agent actor-critic framework and formally shows that DOP critics have sufficient representational capability to guarantee convergence.
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
This paper considers variance reduction methods that were developed for Monte Carlo estimates of integrals, and gives bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties of the controlled system.
Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines
The experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks and high-dimensional hand manipulation and synthetic tasks, and the general idea of including additional information in baselines for improved variance reduction can be extended to partially observed and multi-agent tasks.
The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games
This work shows that PPO-based multi-agent algorithms achieve surprisingly strong performance in four popularMulti-agent testbeds: the particle-world environments, the StarCraft multi- agent challenge, the Hanabi challenge, and Google Research Football, with minimal hyperparameter tuning and without any domain-specific algorithmic modiflcations or architectures.
Infinite-Horizon Policy-Gradient Estimation
GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies, is introduced.
Expected Policy Gradients for Reinforcement Learning
It is proved that EPG reduces the variance of the gradient estimates without requiring deterministic policies and with little computational overhead, and established a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases.
The Mirage of Action-Dependent Baselines in Reinforcement Learning
The variance decomposition of the policy gradient estimator is decompose and it is numerically shown that learned state-action-dependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains.
The Optimal Reward Baseline for Gradient-Based Reinforcement Learning
This work incorporates a reward baseline into the learning system, and shows that it affects variance without introducing further bias, and finds the optimal constant reward baseline is equal to the long-term average expected reward.
Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
© ICLR 2019 - Conference Track Proceedings. All rights reserved. Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective