Corpus ID: 219708452

Accelerating Online Reinforcement Learning with Offline Datasets

@article{Nair2020AcceleratingOR,
  title={Accelerating Online Reinforcement Learning with Offline Datasets},
  author={Ashvin Nair and Murtaza Dalal and Abhishek Gupta and Sergey Levine},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.09359}
}
Reinforcement learning provides an appealing formalism for learning control policies from experience. However, the classic active formulation of reinforcement learning necessitates a lengthy active exploration process for each behavior, making it difficult to apply in real-world settings. If we can instead allow reinforcement learning to effectively use previously collected data to aid the online learning process, where the data could be expert demonstrations or more generally any prior… Expand
A Workflow for Offline Model-Free Robotic Reinforcement Learning
TLDR
This paper develops a practical workflow for using offline RL analogous to the relatively well-understood workflows for supervised learning problems, and devise a set of metrics and conditions that can be tracked over the course of offline training, and can inform the practitioner about how the algorithm and model architecture should be adjusted to improve final performance. Expand
Offline Meta-Reinforcement Learning with Online Self-Supervision
TLDR
A hybrid offline meta-RL algorithm is proposed, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any ground truth reward labels, to bridge this distribution shift problem. Expand
Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble
TLDR
This paper proposes a balanced replay scheme that prioritizes samples encountered online while also encouraging the use of near-on-policy samples from the offline dataset, and leverages multiple Q-functions trained pessimistically offline, thereby preventing overoptimism concerning unfamiliar actions at novel states during the initial training phase. Expand
The Difficulty of Passive Learning in Deep Reinforcement Learning
TLDR
This work proposes the “tandem learning” experimental paradigm, and identifies function approximation in conjunction with fixed data distributions as the strongest factors, thereby extending but also challenging hypotheses stated in past work. Expand
Addressing Distribution Shift in Online Reinforcement Learning with Offline Datasets
TLDR
A simple yet effective framework that incorporates a balanced replay scheme and an ensemble distillation scheme that improves the policy using the Q-ensemble during fine-tuning, which allows the policy updates to be more robust to error in each individual Q-function. Expand
Offline Reinforcement Learning with Implicit Q-Learning
TLDR
This work proposes a new offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization, called implicit Q-learning (IQL). Expand
Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning
TLDR
Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly, is proposed and observed that UWAC substantially improves model stability during training. Expand
Offline Reinforcement Learning with Value-based Episodic Memory
TLDR
This paper adopts a different framework, which learns the V -function instead of the Q-function to naturally keep the learning procedure within the support of an offline dataset, and proposes Expectile V -Learning (EVL), which smoothly interpolates between the optimal value learning and behavior cloning. Expand
Offline Inverse Reinforcement Learning
The objective of offline RL is to learn optimal policies when a fixed exploratory demonstrations data-set is available and sampling additional observations is impossible (typically if this operationExpand
Offline Reinforcement Learning Workshop at Neural Information Processing Systems, 2020 UNCERTAINTY WEIGHTED OFFLINE REINFORCEMENT LEARNING
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. However, existing Q-learning and actor-critic basedExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 54 REFERENCES
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
TLDR
This work develops a novel class of off-policy batch RL algorithms, able to effectively learn offline, without exploring, from a fixed batch of human interaction data, using models pre-trained on data as a strong prior, and uses KL-control to penalize divergence from this prior during RL training. Expand
Off-Policy Deep Reinforcement Learning without Exploration
TLDR
This paper introduces a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. Expand
Behavior Regularized Offline Reinforcement Learning
TLDR
A general framework, behavior regularized actor critic (BRAC), is introduced to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks. Expand
Reinforcement Learning from Imperfect Demonstrations
TLDR
This work proposes a unified reinforcement learning algorithm, Normalized Actor-Critic (NAC), that effectively normalizes the Q-function, reducing theQ-values of actions unseen in the demonstration data, making NAC robust to suboptimal demonstration data. Expand
Exponentially Weighted Imitation Learning for Batched Historical Data
TLDR
A monotonic advantage reweighted imitation learning strategy that is applicable to problems with complex nonlinear function approximation and works well with hybrid (discrete and continuous) action space and can be used to learn from data generated by an unknown policy. Expand
Overcoming Exploration in Reinforcement Learning with Demonstrations
TLDR
This work uses demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Expand
Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning
TLDR
This work simplifies the long-horizon policy learning problem by using a novel data-relabeling algorithm for learning goal-conditioned hierarchical policies, where the low-level only acts for a fixed number of steps, regardless of the goal achieved. Expand
Batch Reinforcement Learning
TLDR
This chapter introduces the basic principles and the theory behind batch reinforcement learning, the most important algorithms, exemplarily discuss ongoing research within this field, and briefly survey real-world applications ofbatch reinforcement learning. Expand
Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction
TLDR
A practical algorithm, bootstrapping error accumulation reduction (BEAR), is proposed and it is demonstrated that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks. Expand
Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards
TLDR
A general and model-free approach for Reinforcement Learning on real robotics with sparse rewards built upon the Deep Deterministic Policy Gradient algorithm to use demonstrations that out-performs DDPG, and does not require engineered rewards. Expand
...
1
2
3
4
5
...