Corpus ID: 54457299

# Off-Policy Deep Reinforcement Learning without Exploration

@inproceedings{Fujimoto2019OffPolicyDR,
title={Off-Policy Deep Reinforcement Learning without Exploration},
author={Scott Fujimoto and David Meger and Doina Precup},
booktitle={ICML},
year={2019}
}
• Published in ICML 2019
• Computer Science, Mathematics
Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. [...] Key Method We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement…Expand
311 Citations

#### Figures, Tables, and Topics from this paper

Benchmarking Batch Deep Reinforcement Learning Algorithms
• Computer Science, Mathematics
• ArXiv
• 2019
This paper benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy, and finds that many of these algorithms underperform DQN trained online with the same amount of data. Expand
The Least Restriction for Offline Reinforcement Learning
A creative offline RL framework, the Least Restriction (LR), is proposed, which is able to learn robustly from different offline datasets, including random and suboptimal demonstrations, on a range of practical control tasks. Expand
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
This work develops a novel class of off-policy batch RL algorithms, able to effectively learn offline, without exploring, from a fixed batch of human interaction data, using models pre-trained on data as a strong prior, and uses KL-control to penalize divergence from this prior during RL training. Expand
Batch Reinforcement Learning Through Continuation Method
• Computer Science
• ICLR
• 2021
This work proposes a simple yet effective policy iteration approach to batch RL using global optimization techniques known as continuation, constraining the difference between the learned policy and the behavior policy that generates the fixed trajectories, and continuously relaxing the constraint. Expand
Batch Reinforcement Learning with Hyperparameter Gradients
• Computer Science
• ICML
• 2020
BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement. Expand
PLAS: Latent Action Space for Offline Reinforcement Learning
• Computer Science
• ArXiv
• 2020
This work proposes to simply learn the Policy in the Latent Action Space (PLAS) such that this requirement is naturally satisfied, and demonstrates that this method provides competitive performance consistently across various continuous control tasks and different types of datasets, outperforming existing offline reinforcement learning methods with explicit constraints. Expand
Offline Reinforcement Learning as Anti-Exploration
This paper designs a new offline RL agent that is competitive with the state of the art on a set of continuous control locomotion and manipulation tasks, instantiated with a bonus based on the prediction error of a variational autoencoder. Expand
BRPO: Batch Residual Policy Optimization
This work derives a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance, and shows that BRPO achieves the state-of-the-art performance in a number of tasks. Expand
Provably Good Batch Reinforcement Learning Without Great Exploration
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Doing batch RL in a way that yields a reliable new policy in large domains is challenging: a newExpand
Provably Good Batch Reinforcement Learning Without Great Exploration
• Computer Science, Mathematics
• ArXiv
• 2020
It is shown that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees on the performance of the output policy, and in certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability. Expand

#### References

SHOWING 1-10 OF 92 REFERENCES
Batch Reinforcement Learning
• Computer Science
• Reinforcement Learning
• 2012
This chapter introduces the basic principles and the theory behind batch reinforcement learning, the most important algorithms, exemplarily discuss ongoing research within this field, and briefly survey real-world applications ofbatch reinforcement learning. Expand
Overcoming Exploration in Reinforcement Learning with Demonstrations
• Computer Science, Mathematics
• 2018 IEEE International Conference on Robotics and Automation (ICRA)
• 2018
This work uses demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Expand
Deep Q-learning From Demonstrations
This paper presents an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstrating data and is able to automatically assess the necessary ratio of demonstrationData while learning thanks to a prioritized replay mechanism. Expand
Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction
• Computer Science
• ICML
• 2017
This work presents two gradient procedures that can learn neural network policies for several problems, including a sequential prediction task and several high-dimensional robotics control problems and provides a comprehensive theoretical study of IL. Expand
Continuous control with deep reinforcement learning
This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs. Expand
Residual Policy Learning
• Computer Science, Engineering
• ArXiv
• 2018
It is argued that RPL is a promising approach for combining the complementary strengths of deep reinforcement learning and robotic control, pushing the boundaries of what either can achieve independently. Expand
Benchmarking Deep Reinforcement Learning for Continuous Control
• Computer Science, Mathematics
• ICML
• 2016
This work presents a benchmark suite of continuous control tasks, including classic tasks like cart-pole swing-up, tasks with very high state and action dimensionality such as 3D humanoid locomotion, task with partial observations, and tasks with hierarchical structure. Expand
Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards
A general and model-free approach for Reinforcement Learning on real robotics with sparse rewards built upon the Deep Deterministic Policy Gradient algorithm to use demonstrations that out-performs DDPG, and does not require engineered rewards. Expand
Safe and Efficient Off-Policy Reinforcement Learning
• Computer Science, Mathematics
• NIPS
• 2016
A novel algorithm, Retrace ($\lambda$), is derived, believed to be the first return-based off-policy control algorithm converging a.s. to $Q^*$ without the GLIE assumption (Greedy in the Limit with Infinite Exploration). Expand
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
• Computer Science, Mathematics
• NeurIPS
• 2018
This paper proposes a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation, which matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples. Expand