# Off-Policy Deep Reinforcement Learning without Exploration

@inproceedings{Fujimoto2019OffPolicyDR, title={Off-Policy Deep Reinforcement Learning without Exploration}, author={Scott Fujimoto and David Meger and Doina Precup}, booktitle={ICML}, year={2019} }

Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. [...] Key Method We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcementâ€¦ Expand

#### Figures, Tables, and Topics from this paper

#### 311 Citations

Benchmarking Batch Deep Reinforcement Learning Algorithms

- Computer Science, Mathematics
- ArXiv
- 2019

This paper benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy, and finds that many of these algorithms underperform DQN trained online with the same amount of data. Expand

The Least Restriction for Offline Reinforcement Learning

- Computer Science
- ArXiv
- 2021

A creative offline RL framework, the Least Restriction (LR), is proposed, which is able to learn robustly from different offline datasets, including random and suboptimal demonstrations, on a range of practical control tasks. Expand

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

- Computer Science, Mathematics
- ArXiv
- 2019

This work develops a novel class of off-policy batch RL algorithms, able to effectively learn offline, without exploring, from a fixed batch of human interaction data, using models pre-trained on data as a strong prior, and uses KL-control to penalize divergence from this prior during RL training. Expand

Batch Reinforcement Learning Through Continuation Method

- Computer Science
- ICLR
- 2021

This work proposes a simple yet effective policy iteration approach to batch RL using global optimization techniques known as continuation, constraining the difference between the learned policy and the behavior policy that generates the fixed trajectories, and continuously relaxing the constraint. Expand

Batch Reinforcement Learning with Hyperparameter Gradients

- Computer Science
- ICML
- 2020

BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement. Expand

PLAS: Latent Action Space for Offline Reinforcement Learning

- Computer Science
- ArXiv
- 2020

This work proposes to simply learn the Policy in the Latent Action Space (PLAS) such that this requirement is naturally satisfied, and demonstrates that this method provides competitive performance consistently across various continuous control tasks and different types of datasets, outperforming existing offline reinforcement learning methods with explicit constraints. Expand

Offline Reinforcement Learning as Anti-Exploration

- Computer Science
- ArXiv
- 2021

This paper designs a new offline RL agent that is competitive with the state of the art on a set of continuous control locomotion and manipulation tasks, instantiated with a bonus based on the prediction error of a variational autoencoder. Expand

BRPO: Batch Residual Policy Optimization

- Computer Science, Mathematics
- ArXiv
- 2020

This work derives a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance, and shows that BRPO achieves the state-of-the-art performance in a number of tasks. Expand

Provably Good Batch Reinforcement Learning Without Great Exploration

- 2020

Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Doing batch RL in a way that yields a reliable new policy in large domains is challenging: a newâ€¦ Expand

Provably Good Batch Reinforcement Learning Without Great Exploration

- Computer Science, Mathematics
- ArXiv
- 2020

It is shown that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees on the performance of the output policy, and in certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability. Expand

#### References

SHOWING 1-10 OF 92 REFERENCES

Batch Reinforcement Learning

- Computer Science
- Reinforcement Learning
- 2012

This chapter introduces the basic principles and the theory behind batch reinforcement learning, the most important algorithms, exemplarily discuss ongoing research within this field, and briefly survey real-world applications ofbatch reinforcement learning. Expand

Overcoming Exploration in Reinforcement Learning with Demonstrations

- Computer Science, Mathematics
- 2018 IEEE International Conference on Robotics and Automation (ICRA)
- 2018

This work uses demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Expand

Deep Q-learning From Demonstrations

- Computer Science
- AAAI
- 2018

This paper presents an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstrating data and is able to automatically assess the necessary ratio of demonstrationData while learning thanks to a prioritized replay mechanism. Expand

Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction

- Computer Science
- ICML
- 2017

This work presents two gradient procedures that can learn neural network policies for several problems, including a sequential prediction task and several high-dimensional robotics control problems and provides a comprehensive theoretical study of IL. Expand

Continuous control with deep reinforcement learning

- Computer Science, Mathematics
- ICLR
- 2016

This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs. Expand

Residual Policy Learning

- Computer Science, Engineering
- ArXiv
- 2018

It is argued that RPL is a promising approach for combining the complementary strengths of deep reinforcement learning and robotic control, pushing the boundaries of what either can achieve independently. Expand

Benchmarking Deep Reinforcement Learning for Continuous Control

- Computer Science, Mathematics
- ICML
- 2016

This work presents a benchmark suite of continuous control tasks, including classic tasks like cart-pole swing-up, tasks with very high state and action dimensionality such as 3D humanoid locomotion, task with partial observations, and tasks with hierarchical structure. Expand

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

- Computer Science
- ArXiv
- 2017

A general and model-free approach for Reinforcement Learning on real robotics with sparse rewards built upon the Deep Deterministic Policy Gradient algorithm to use demonstrations that out-performs DDPG, and does not require engineered rewards. Expand

Safe and Efficient Off-Policy Reinforcement Learning

- Computer Science, Mathematics
- NIPS
- 2016

A novel algorithm, Retrace ($\lambda$), is derived, believed to be the first return-based off-policy control algorithm converging a.s. to $Q^*$ without the GLIE assumption (Greedy in the Limit with Infinite Exploration). Expand

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

- Computer Science, Mathematics
- NeurIPS
- 2018

This paper proposes a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation, which matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples. Expand