• Corpus ID: 239768474

Safely Bridging Offline and Online Reinforcement Learning

@article{Xu2021SafelyBO,
  title={Safely Bridging Offline and Online Reinforcement Learning},
  author={Wanqiao Xu and Kan Xu and Hamsa Bastani and Osbert Bastani},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.13060}
}
A key challenge to deploying reinforcement learning in practice is exploring safely. We propose a natural safety property—uniformly outperforming a conservative policy (adaptively estimated from all data observed thus far), up to a per-episode exploration budget. We then design an algorithm that uses a UCB reinforcement learning policy for exploration, but overrides it as needed to ensure safety with high probability. We experimentally validate our results on a sepsis treatment task… 

Figures from this paper

Safe Data Collection for Offline and Online Policy Learning

The Safe Phased-Elimination ( SafePE) algorithm is developed that can achieve optimal regret bound with only logarithmic number of policy updates and is applicable to the safe online learning setting.

Finding Safe Zones of policies Markov Decision Processes

The main result is a bi-criteria approximation algorithm which gives a factor of almost 2 approximation for both the escape probability and SafeZone size, using a polynomial size sample complexity.

SCOPE: Safe Exploration for Dynamic Computer Systems Optimization

This work evaluates SCOPE’s ability to deliver improved latency while minimizing power constraint violations by dynamically configuring hardware while running a variety of Apache Spark applications.

References

SHOWING 1-10 OF 18 REFERENCES

Safe Reinforcement Learning via Shielding

This work proposes a new approach to learn optimal policies while enforcing properties expressed in temporal logic by synthesizing a reactive system called a shield that monitors the actions from the learner and corrects them only if the chosen action causes a violation of the specification.

Conservative Q-Learning for Offline Reinforcement Learning

Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.

A comprehensive survey on safe reinforcement learning

This work categorize and analyze two approaches of Safe Reinforcement Learning, based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor and the incorporation of external knowledge or the guidance of a risk metric.

Conservative Exploration in Reinforcement Learning

This paper introduces the notion of conservative exploration for average reward and finite horizon problems, and presents two optimistic algorithms that guarantee (w.h.p.) that the conservative constraint is never violated during learning.

MOPO: Model-based Offline Policy Optimization

A new model-based offline RL algorithm is proposed that applies the variance of a Lipschitz-regularized model as a penalty to the reward function, and it is found that this algorithm outperforms both standard model- based RL methods and existing state-of-the-art model-free offline RL approaches on existing offline RL benchmarks, as well as two challenging continuous control tasks.

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that

Offline policy evaluation across representations with applications to educational games

A data-driven methodology for comparing and validating policies offline, which focuses on the ability of each policy to generalize to new data and applies to a partially-observable, high-dimensional concept sequencing problem in an educational game.

Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

An Bayesian expected regret bound for PSRL in finite-horizon episodic Markov decision processes is established, which improves upon the best previous bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm.

Robust Model Predictive Shielding for Safe Reinforcement Learning with Stochastic Dynamics

  • Shuo LiO. Bastani
  • Computer Science
    2020 IEEE International Conference on Robotics and Automation (ICRA)
  • 2020
This work proposes a framework for safe reinforcement learning that can handle stochastic nonlinear dynamical systems, and proposes to use a tube-based robust nonlinear model predictive controller (NMPC) as the backup controller.

Near-optimal Regret Bounds for Reinforcement Learning

This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.