# RvS: What is Essential for Offline RL via Supervised Learning?

@article{Emmons2021RvSWI, title={RvS: What is Essential for Offline RL via Supervised Learning?}, author={Scott Emmons and Benjamin Eysenbach and Ilya Kostrikov and Sergey Levine}, journal={ArXiv}, year={2021}, volume={abs/2112.10751} }

Recent work has shown that supervised learning alone, without temporal differ-ence (TD) learning, can be remarkably effective for ofﬂine RL. When does this hold true, and which algorithmic components are necessary? Through extensive experiments, we boil supervised learning for ofﬂine RL down to its essential elements. In every environment suite we consider, simply maximizing likelihood with a two-layer feedforward MLP is competitive with state-of-the-art results of substantially more complex…

## 37 Citations

### When does return-conditioned supervised learning work for offline reinforcement learning?

- Computer ScienceArXiv
- 2022

It is shown that RCSL returns the optimal policy under a set of assumptions that are stronger than those needed for the more traditional dynamic programming-based algorithms.

### Online Decision Transformer

- Computer ScienceICML
- 2022

This work proposes Online Decision Transformers (ODT), an RL algorithm based on sequence modeling that blends offline pretraining with online finetuning in a unified framework that is competitive with the state-of-the-art in absolute performance on the D4RL benchmark but shows much more significant gains during the finet tuning procedure.

### You Can't Count on Luck: Why Decision Transformers and RvS Fail in Stochastic Environments

- Computer Science
- 2022

The proposed method, ESPER, learns to cluster trajectories and conditions on average cluster returns, which are independent from environment stochasticity, which allows ESPER to achieve strong alignment between target return and expected performance in real environments.

### Dichotomy of Control: Separating What You Can Control from What You Cannot

- Computer ScienceArXiv
- 2022

The dichotomy of control is proposed, a future-conditioned supervised learning framework that separates mechanisms within a policy’s control (actions) from those beyond a policy's control (environment stochasticity) by conditioning the policy on a latent variable representation of the future, and designing a mutual information constraint that removes any information from the latent variable associated with randomness in the environment.

### ConserWeightive Behavioral Cloning for Reliable Offline Reinforcement Learning

- Computer ScienceArXiv
- 2022

ConserWeightive Behavioral Cloning (CWBC) is proposed, a simple and effective method for improving the performance of conditional BC for ofﬂine RL with two key components: trajectory weighting and conservative regularization.

### You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments

- Computer ScienceArXiv
- 2022

The proposed method, ESPER, learns to cluster trajectories and conditions on average cluster returns, which are independent from environment stochasticity, which allows ESPER to achieve strong alignment between target return and expected performance in real environments.

### Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories

- Computer ScienceArXiv
- 2022

A simple meta-algorithmic pipeline is developed that learns an inverse-dynamics model on the labelled data to obtainproxy-labels for the unlabelled data, followed by the use of any ofﬂine RL algorithm on the true and proxy-labelled trajectories.

### R ETURN A UGMENTATION GIVES S UPERVISED RL T EMPORAL C OMPOSITIONALITY

- Computer Science
- 2022

S UPER B, a dynamic programming algorithm for data augmentation that augments the returns in the offline dataset by combining rewards from intersecting trajectories, is introduced, showing theoretically that SUPER B can improve sample complexity and enable RvS to find optimal policies in cases where it previously fell behind the performance of value-based methods.

### Contrastive Learning as Goal-Conditioned Reinforcement Learning

- Computer ScienceArXiv
- 2022

This paper builds upon prior work and applies contrastive representation learning to action-labeled trajectories, in such a way that the (inner product of) learned representations exactly corresponds to a goal-conditioned value function.

### Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL

- Computer ScienceArXiv
- 2022

Q-learning Decision Transformer (QDT) is proposed that addresses the shortcomings of DT by leveraging the beneﬁt of Dynamic Programming (Q- learning) and compensates for each other’s shortcomings to achieve better performance.

## References

SHOWING 1-10 OF 43 REFERENCES

### D4RL: Datasets for Deep Data-Driven Reinforcement Learning

- Computer ScienceArXiv
- 2020

This work introduces benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL, and releases benchmark tasks and datasets with a comprehensive evaluation of existing algorithms and an evaluation protocol together with an open-source codebase.

### A Minimalist Approach to Offline Reinforcement Learning

- Computer ScienceNeurIPS
- 2021

It is shown that the performance of state-of-the-art RL algorithms can be matched by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data, and the resulting algorithm is a simple to implement and tune baseline.

### Conservative Q-Learning for Offline Reinforcement Learning

- Computer ScienceNeurIPS
- 2020

Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.

### MOReL : Model-Based Offline Reinforcement Learning

- Computer ScienceNeurIPS
- 2020

Theoretically, it is shown that MOReL is minimax optimal (up to log factors) for offline RL, and through experiments, it matches or exceeds state-of-the-art results in widely studied offline RL benchmarks.

### Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

- Computer ScienceICML
- 2021

Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly, is proposed and observed that UWAC substantially improves model stability during training.

### Overcoming Model Bias for Robust Offline Deep Reinforcement Learning

- Computer ScienceEng. Appl. Artif. Intell.
- 2021

### MOPO: Model-based Offline Policy Optimization

- Computer ScienceNeurIPS
- 2020

A new model-based offline RL algorithm is proposed that applies the variance of a Lipschitz-regularized model as a penalty to the reward function, and it is found that this algorithm outperforms both standard model- based RL methods and existing state-of-the-art model-free offline RL approaches on existing offline RL benchmarks, as well as two challenging continuous control tasks.

### Hyperparameter Selection for Offline Reinforcement Learning

- Computer ScienceArXiv
- 2020

This work focuses on offline hyperparameter selection, i.e. methods for choosing the best policy from a set of many policies trained using different hyperparameters, given only logged data, and shows that offline RL algorithms are not robust tohyperparameter choices.

### Reward-Conditioned Policies

- Computer ScienceArXiv
- 2019

This work shows how non-expert trajectories collected from sub-optimal policies can be viewed as optimal supervision, not for maximizing the reward, but for matching the reward of the given trajectory, and how this approach can be derived as a principled method for policy search.

### Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning

- Computer ScienceCoRL
- 2019

This work simplifies the long-horizon policy learning problem by using a novel data-relabeling algorithm for learning goal-conditioned hierarchical policies, where the low-level only acts for a fixed number of steps, regardless of the goal achieved.