# RvS: What is Essential for Offline RL via Supervised Learning?

@article{Emmons2022RvSWI, title={RvS: What is Essential for Offline RL via Supervised Learning?}, author={Scott Emmons and Benjamin Eysenbach and Ilya Kostrikov and Sergey Levine}, journal={ArXiv}, year={2022}, volume={abs/2112.10751} }

Recent work has shown that supervised learning alone, without temporal differ-ence (TD) learning, can be remarkably effective for ofﬂine RL. When does this hold true, and which algorithmic components are necessary? Through extensive experiments, we boil supervised learning for ofﬂine RL down to its essential elements. In every environment suite we consider, simply maximizing likelihood with a two-layer feedforward MLP is competitive with state-of-the-art results of substantially more complex…

## 32 Citations

### When does return-conditioned supervised learning work for offline reinforcement learning?

- Computer ScienceArXiv
- 2022

It is shown that RCSL returns the optimal policy under a set of assumptions that are stronger than those needed for the more traditional dynamic programming-based algorithms.

### Dichotomy of Control: Separating What You Can Control from What You Cannot

- Computer ScienceArXiv
- 2022

The dichotomy of control is proposed, a future-conditioned supervised learning framework that separates mechanisms within a policy’s control (actions) from those beyond a policy's control (environment stochasticity) by conditioning the policy on a latent variable representation of the future, and designing a mutual information constraint that removes any information from the latent variable associated with randomness in the environment.

### ConserWeightive Behavioral Cloning for Reliable Offline Reinforcement Learning

- Computer Science
- 2022

ConserWeightive Behavioral Cloning (CWBC) is proposed, a simple and effective method for improving the performance of conditional BC for ofﬂine RL with two key components: trajectory weighting and conservative regularization.

### You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments

- Computer ScienceArXiv
- 2022

The proposed method, ESPER, learns to cluster trajectories and conditions on average cluster returns, which are independent from environment stochasticity, which allows ESPER to achieve strong alignment between target return and expected performance in real environments.

### Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories

- Computer Science
- 2022

A simple meta-algorithmic pipeline is developed that learns an inverse-dynamics model on the labelled data to obtainproxy-labels for the unlabelled data, followed by the use of any ofﬂine RL algorithm on the true and proxy-labelled trajectories.

### Contrastive Learning as Goal-Conditioned Reinforcement Learning

- Computer ScienceArXiv
- 2022

This paper builds upon prior work and applies contrastive representation learning to action-labeled trajectories, in such a way that the (inner product of) learned representations exactly corresponds to a goal-conditioned value function.

### Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL

- Computer ScienceArXiv
- 2022

Q-learning Decision Transformer (QDT) is proposed that addresses the shortcomings of DT by leveraging the beneﬁt of Dynamic Programming (Q- learning) and compensates for each other’s shortcomings to achieve better performance.

### A P ROBABILISTIC P ERSPECTIVE ON R EINFORCEMENT L EARNING VIA S UPERVISED L EARNING

- Computer Science
- 2022

A novel algorithm is introduced, Implicit RvS, leveraging powerful density estimation techniques that can easily be tilted to produce desirable behaviors and is compared to a suite of R vS algorithms on the D4RL benchmark.

### All You Need Is Supervised Learning: From Imitation Learning to Meta-RL With Upside Down RL

- Computer ScienceArXiv
- 2022

This work shows that this single UDRL algorithm can also work in the imitation learning and offline RL settings, be extended to the goal-conditioned RL setting, and even the meta-RL setting.

### Implicit Offline Reinforcement Learning via Supervised Learning

- Computer Science
- 2022

It is shown how implicit models can leverage return information and match or outperform explicit algorithms to acquire robotic skills from fixed datasets and the close relationship between these implicit methods and other popular RL via Supervised Learning algorithms to provide a unified framework is shown.

## References

SHOWING 1-10 OF 43 REFERENCES

### D4RL: Datasets for Deep Data-Driven Reinforcement Learning

- Computer ScienceArXiv
- 2020

This work introduces benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL, and releases benchmark tasks and datasets with a comprehensive evaluation of existing algorithms and an evaluation protocol together with an open-source codebase.

### A Minimalist Approach to Offline Reinforcement Learning

- Computer ScienceNeurIPS
- 2021

It is shown that the performance of state-of-the-art RL algorithms can be matched by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data, and the resulting algorithm is a simple to implement and tune baseline.

### Conservative Q-Learning for Offline Reinforcement Learning

- Computer ScienceNeurIPS
- 2020

Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.

### MOReL : Model-Based Offline Reinforcement Learning

- Computer ScienceNeurIPS
- 2020

Theoretically, it is shown that MOReL is minimax optimal (up to log factors) for offline RL, and through experiments, it matches or exceeds state-of-the-art results in widely studied offline RL benchmarks.

### Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

- Computer ScienceICML
- 2021

Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly, is proposed and observed that UWAC substantially improves model stability during training.

### Overcoming Model Bias for Robust Offline Deep Reinforcement Learning

- Computer ScienceEng. Appl. Artif. Intell.
- 2021

### MOPO: Model-based Offline Policy Optimization

- Computer ScienceNeurIPS
- 2020

A new model-based offline RL algorithm is proposed that applies the variance of a Lipschitz-regularized model as a penalty to the reward function, and it is found that this algorithm outperforms both standard model- based RL methods and existing state-of-the-art model-free offline RL approaches on existing offline RL benchmarks, as well as two challenging continuous control tasks.

### Hyperparameter Selection for Offline Reinforcement Learning

- Computer ScienceArXiv
- 2020

This work focuses on offline hyperparameter selection, i.e. methods for choosing the best policy from a set of many policies trained using different hyperparameters, given only logged data, and shows that offline RL algorithms are not robust tohyperparameter choices.

### Learning to Reach Goals via Iterated Supervised Learning

- Computer ScienceICLR
- 2021

This paper proposes a simple algorithm in which an agent continually relabels and imitates the trajectories it generates to progressively learn goal- reaching behaviors from scratch, and formally shows that this iterated supervised learning procedure optimizes a bound on the RL objective, derive performance bounds of the learned policy, and empirically demonstrates improved goal-reaching performance and robustness over current RL algorithms in several benchmark tasks.

### Reward-Conditioned Policies

- Computer ScienceArXiv
- 2019

This work shows how non-expert trajectories collected from sub-optimal policies can be viewed as optimal supervision, not for maximizing the reward, but for matching the reward of the given trajectory, and how this approach can be derived as a principled method for policy search.