• Corpus ID: 220127951

Critic Regularized Regression

@article{Wang2020CriticRR,
  title={Critic Regularized Regression},
  author={Ziyun Wang and Alexander Novikov and Konrad Zolna and Jost Tobias Springenberg and Scott E. Reed and Bobak Shahriari and Noah Siegel and Josh Merel and Caglar Gulcehre and Nicolas Manfred Otto Heess and Nando de Freitas},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.15134}
}
Offline reinforcement learning (RL), also known as batch RL, offers the prospect of policy optimization from large pre-recorded datasets without online environment interaction. It addresses challenges with regard to the cost of data collection and safety, both of which are particularly pertinent to real-world applications of RL. Unfortunately, most off-policy algorithms perform poorly when learning from a fixed dataset. In this paper, we propose a novel offline RL algorithm to learn policies… 

Figures and Tables from this paper

Personalization for Web-based Services using Offline Reinforcement Learning
TLDR
This work addresses challenges of learning UI policies through model-free offline Reinforcement Learning (RL) with off-policy training and significantly improves long-term objectives in a production system for user authentication in amajor social network.
The Challenges of Exploration for Offline Reinforcement Learning
TLDR
This work proposes to evaluate the quality of collected data by transferring the collected data and inferring policies with reward relabelling and standard offline RL algorithms, and evaluates a wide variety of data collection strategies, including a new exploration agent, Intrinsic Model Predictive Control, using this scheme.
Online and Offline Reinforcement Learning by Planning with a Learned Model
TLDR
The Reanalyse algorithm is described, which uses modelbased policy and value improvement operators to compute new improved training targets on existing data points, allowing efficient learning for data budgets varying by several orders of magnitude.
Co-Adaptation of Algorithmic and Implementational Innovations in Inference-based Deep Reinforcement Learning
TLDR
This work focuses on a series of inference-based actor-critic algorithms – MPO, AWR, and SAC – to decouple their algorithmic innovations and implementation decisions through a single control-as-inference objective, and shows which implementation details are co-adapted and coevolved with algorithms and which are transferable across algorithms.
Continuous Doubly Constrained Batch Reinforcement Learning
TLDR
An algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment, which compares favorably to state-of-the-art methods, regardless of how the offline data were collected.
How to Spend Your Robot Time: Bridging Kickstarting and Offline Reinforcement Learning for Vision-based Robotic Manipulation
TLDR
This work develops two RL algorithms that can speed up training by using not only the action distributions of teacher policies, but also data collected by such policies on the task at hand, and investigates ways to minimize online interactions in a target task, by reusing a suboptimal policy.
CO PTI DICE: O FFLINE C ONSTRAINED R EINFORCE MENT L EARNING VIA S TATIONARY D ISTRIBUTION C ORRECTION E STIMATION
TLDR
This paper presents an offline constrained RL algorithm, COptiDICE, that directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation
TLDR
This paper presents an offline constrained RL algorithm, COptiDICE, that directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
Semi-supervised reward learning for offline reinforcement learning
TLDR
This work greatly improves upon behavioural cloning and closely approach the performance achieved with ground truth rewards, and further investigates the relationship between the quality of the reward model and the final policies.
Offline Learning from Demonstrations and Unlabeled Experience
TLDR
Across a diverse set of continuous control and simulated robotic manipulation tasks, it is shown that ORIL consistently outperforms comparable BC agents by effectively leveraging unlabeled experience.
...
...

References

SHOWING 1-10 OF 41 REFERENCES
Off-Policy Deep Reinforcement Learning without Exploration
TLDR
This paper introduces a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data.
Distributional policy gradients
  • In International Conference on Learning Representations,
  • 2018
Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning
TLDR
This paper admits the use of data generated by arbitrary behavior policies and uses a learned prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task.
Maximum a Posteriori Policy Optimisation
TLDR
This work introduces a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective and develops two off-policy algorithms that are competitive with the state-of-the-art in deep reinforcement learning.
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
TLDR
A simple and scalable reinforcement learning algorithm that uses standard supervised learning methods as subroutines and is able to acquire more effective policies than most off-policy algorithms when learning from purely static datasets with no additional environmental interactions is developed.
Fitted Q-iteration by Advantage Weighted Regression
TLDR
It is shown that by using a soft-greedy action selection the policy improvement step used in FQI can be simplified to an inexpensive advantage-weighted regression, which is able to derive a new, computationally efficient F QI algorithm which can even deal with high dimensional action spaces.
RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning
TLDR
This paper proposes a benchmark called RL Unplugged to evaluate and compare offline RL methods, a suite of benchmarks that will increase the reproducibility of experiments and make it possible to study challenging tasks with a limited computational budget, thus making RL research both more systematic and more accessible across the community.
Accelerating Online Reinforcement Learning with Offline Datasets
TLDR
A novel algorithm is proposed that combines sample-efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of reinforcement learning policies.
Acme: A Research Framework for Distributed Reinforcement Learning
TLDR
It is shown that the design decisions behind Acme lead to agents that can be scaled both up and down and that, for the most part, greater levels of parallelization result in agents with equivalent performance, just faster.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that
...
...