• Corpus ID: 245335360

Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks

  title={Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks},
  author={Linghui Meng and Muning Wen and Yaodong Yang and Chenyang Le and Xiyun Li and Weinan Zhang and Ying Wen and Haifeng Zhang and Jun Wang and Bo Xu},
Offline reinforcement learning leverages previously-collected offline datasets to learn optimal policies with no necessity to access the real environment. Such a paradigm is also desirable for multi-agent reinforcement learning (MARL) tasks, given the increased interactions among agents and with the enviroment. Yet, in MARL, the paradigm of offline pre-training with online fine-tuning has not been studied, nor datasets or benchmarks for offline MARL research are available. In this paper, we… 

Offline Multi-Agent Reinforcement Learning with Knowledge Distillation

This work introduces an offline multi-agent reinforcement learning (offline MARL) framework that utilizes previously collected data without additional online data collection and reformulates offline MARL as a sequence modeling problem and thus builds on top of the simplicity and scalability of the Transformer architecture.

Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space

Planning to Practice is proposed, a method that makes it practical to train goal-conditioned policies for long-horizon tasks that require multiple distinct types of interactions to solve and can generate feasible sequences of subgoals that enable the policy to efficiently solve the target tasks.

Pretraining in Deep Reinforcement Learning: A Survey

This survey seeks to systematically review existing works in pretraining for deep reinforcement learning, provide a taxonomy of these methods, discuss each sub-field, and bring attention to open problems and future directions.

Learning on the Job: Self-Rewarding Offline-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision

One of the main observations made in this work is that, with a suitable representation learning and domain generalization approach, it can be significantly easier for the reward function to generalize to a new but structurally similar task than for the policy, which means that a learned reward function can be used to facil-itate theuning of the robot’s policy in situations where the policy fails to generalizing in zero shot, but the rewarded function generalizes successfully.

Mildly Conservative Q-Learning for Offline Reinforcement Learning

This paper proposes Mildly Conservative Q -learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values and theoretically shows that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OODactions.

Evaluating Decision Transformer Architecture on Robot Learning Tasks Erklärung zur Abschlussarbeit gemäß §22 Abs. 7 und §23 Abs. 7 APB der TU Darmstadt Abbreviations, Symbols and Operators

An extension of DT is proposed, called Decision LSTM (DLSTM), an architecture that replaces the Transformer model inside DT by an Long Short-Term Memory Network (LSTM) architecture, which shows that DLSTM outperforms both BC and DT and achieves expert level in the stabilization tasks.

GCS: Graph-based Coordination Strategy for Multi-Agent Reinforcement Learning

This work proposes factorizing the joint team policy into a graph generator and graph-based coordinated policy to enable coordinated behaviours among agents and demonstrates the superiority of the proposed method on Collaborative Gaussian Squeeze, Cooperative Navigation, and Google Research Football.

Transformer-based Working Memory for Multiagent Reinforcement Learning with Action Parsing

Inspired by the human’s working memory process where a limited capacity of information temporarily held in mind can effectively guide the decision-making, ATM updates its fixed-capacity memory with the working memory updating schema.


This paper proposes Planning to Practice (PTP), a method that makes it practical to train goal-conditioned policies for long-horizon tasks that require multiple distinct types of interactions to solve, and proposes a hybrid offline reinforcement learning approach with online fine-tuning.

How Crucial is Transformer in Decision Transformer?

The results suggest that the strength of the Decision Transformer for continuous control tasks may lie in the overall sequential modeling architecture and not in the Transformer per se.



D4RL: Datasets for Deep Data-Driven Reinforcement Learning

This work introduces benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL, and releases benchmark tasks and datasets with a comprehensive evaluation of existing algorithms and an evaluation protocol together with an open-source codebase.

UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers

This paper makes the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing a single architecture to fit tasks with different observation and action configuration requirements, using a transformer-based model to generate a flexible policy by decoupling the policy distribution from the intertwined input observation.

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

A new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) is developed that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation.

Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning

This paper proposes a novel offline RL algorithm, named Implicit Constraint Q-learning (ICQ), which effectively alleviates the extrapolation error by only trusting the state-action pairs given in the dataset for value estimation and extends ICQ to multi-agent tasks by decomposing the joint-policy under the implicit constraint.

Soft Actor-Critic Algorithms and Applications

Soft Actor-Critic (SAC), the recently introduced off-policy actor-critic algorithm based on the maximum entropy RL framework, achieves state-of-the-art performance, outperforming prior on-policy and off- policy methods in sample-efficiency and asymptotic performance.

Accelerating Online Reinforcement Learning with Offline Datasets

A novel algorithm is proposed that combines sample-efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of reinforcement learning policies.

QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning

A new factorization method for MARL, QTRAN, is proposed, which is free from such structural constraints and takes on a new approach to transforming the original joint action-value function into an easily factorizable one, with the same optimal actions.

Conservative Q-Learning for Offline Reinforcement Learning

Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that

MALib: A Parallel Framework for Population-based Multi-agent Reinforcement Learning

MALib is a scalable and efficient computing framework for population-based multi-agent reinforcement learning that enables efficient code reuse and flexible deployments on different distributed computing paradigms and achieves throughput higher than 40K FPS on a single machine with 32 CPU cores.