• Corpus ID: 235313679

Reinforcement Learning as One Big Sequence Modeling Problem

@inproceedings{Janner2021ReinforcementLA,
  title={Reinforcement Learning as One Big Sequence Modeling Problem},
  author={Michael Janner and Qiyang Li and Sergey Levine},
  booktitle={NeurIPS},
  year={2021}
}
Reinforcement learning (RL) is typically concerned with estimating single-step policies or single-step models, leveraging the Markov property to factorize the problem in time. However, we can also view RL as a sequence modeling problem, with the goal being to predict a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether powerful, high-capacity sequence prediction models that work well in other domains, such as natural-language… 

Figures and Tables from this paper

Decision Transformer: Reinforcement Learning via Sequence Modeling

TLDR
Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

StARformer: Transformer with State-Action-Reward Representations

TLDR
This work proposes State-Action-Reward Transformer (StARformer), which explicitly models local causal relations to help improve action prediction in long sequences in Reinforcement Learning.

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

TLDR
An approach that utilizes a conditional diffusion model as a highly expressive policy class for behavior cloning and policy regularization and can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks for offline RL.

Multi-Agent Reinforcement Learning is a Sequence Modeling Problem

TLDR
A novel architecture named Multi-Agent Transformer is introduced that effectively casts cooperative multi-agent reinforcement learning (MARL) into SM problems wherein the task is to map agents’ observation sequence to agent’s optimal action sequence and endows MAT with monotonic performance improvement guarantee.

Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning

TLDR
This work proposes a method that addresses this optimism bias by ex-plicitly disentangling the policy and world models, which allows us at test time to search for policies that are robust to multiple possible futures in the environment.

Can Wikipedia Help Offline Reinforcement Learning?

TLDR
This work looks to take advantage of this formulation of reinforcement learning as sequence modeling and investigate the transferability of pre-trained sequence models on other domains when finetuned on offline RL tasks (control, games), and proposes techniques to improve transfer between these domains.

Online Decision Transformer

TLDR
Online Decision Transformers (ODT), an RL algorithm based on sequence modeling that blends offline pretraining with online finetuning in a unified framework, is proposed and shown to be competitive with the state-of-the-art in absolute performance on the D4RL benchmark.

Transformers are Sample Efficient World Models

TLDR
IRIS is a data-efficient agent that learns in a world model composed of a discrete autoencoder and an autoregressive Transformer that sets a new state of the art for methods without lookahead search, and even surpasses MuZero.

StARformer: Transformer with State-Action-Reward Representations for Robot Learning.

TLDR
StARformer is proposed, a Transformer architecture for robot learning with image inputs, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like inductive bias to improve long-term modeling.

Deep Transformer Q-Networks for Partially Observable Reinforcement Learning

TLDR
This work proposes Deep Transformer Q-Networks (DTQN), a novel architecture utilizing transformers and self-attention to encode an agent’s history, and demonstrates the transformer can solve partially observable tasks faster and more stably than previous recurrent approaches.
...

References

SHOWING 1-10 OF 75 REFERENCES

Decision Transformer: Reinforcement Learning via Sequence Modeling

TLDR
Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

TLDR
This paper proposes a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation, which matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples.

Model-Ensemble Trust-Region Policy Optimization

TLDR
This paper analyzes the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and shows that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training.

Language as an Abstraction for Hierarchical Deep Reinforcement Learning

TLDR
This paper introduces an open-source object interaction environment built using the MuJoCo physics engine and the CLEVR engine and finds that, using the approach, agents can learn to solve to diverse, temporally-extended tasks such as object sorting and multi-object rearrangement, including from raw pixel observations.

MOReL : Model-Based Offline Reinforcement Learning

TLDR
Theoretically, it is shown that MOReL is minimax optimal (up to log factors) for offline RL, and through experiments, it matches or exceeds state-of-the-art results in widely studied offline RL benchmarks.

Offline Reinforcement Learning with Implicit Q-Learning

TLDR
This work proposes a new offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization, known as implicit Q-learning (IQL).

On the model-based stochastic value gradient for continuous reinforcement learning

TLDR
This paper surpasses the asymptotic performance of other model-based methods on the proprioceptive MuJoCo locomotion tasks from the OpenAI gym, including a humanoid, and achieves these results with a simple deterministic world model without requiring an ensemble.

Learning to Reach Goals via Iterated Supervised Learning

TLDR
This paper proposes a simple algorithm in which an agent continually relabels and imitates the trajectories it generates to progressively learn goal- reaching behaviors from scratch, and formally shows that this iterated supervised learning procedure optimizes a bound on the RL objective, derive performance bounds of the learned policy, and empirically demonstrates improved goal-reaching performance and robustness over current RL algorithms in several benchmark tasks.

Conservative Q-Learning for Offline Reinforcement Learning

TLDR
Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.
...