Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones

@article{Thananjeyan2021RecoveryRS,
  title={Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones},
  author={Brijen Thananjeyan and Ashwin Balakrishna and Suraj Nair and Michael Luo and Krishna Parasuram Srinivasan and Minho Hwang and Joseph E. Gonzalez and Julian Ibarz and Chelsea Finn and Ken Goldberg},
  journal={IEEE Robotics and Automation Letters},
  year={2021},
  volume={6},
  pages={4915-4922}
}
Safety remains a central obstacle preventing widespread use of RL in the real world: learning new tasks in uncertain environments requires extensive exploration, but safety requires limiting exploration. We propose Recovery RL, an algorithm which navigates this tradeoff by (1) leveraging offline data to learn about constraint violating zones before policy learning and (2) separating the goals of improving task performance and constraint satisfaction across two policies: a task policy that only… 
Improving Safety in Deep Reinforcement Learning using Unsupervised Action Planning
TLDR
This work proposes a novel technique of unsupervised action planning to improve the safety of onpolicy reinforcement learning algorithms, such as trust region policy optimization (TRPO) or proximal Policy optimization (PPO).
Towards Safe Reinforcement Learning with a Safety Editor Policy
TLDR
This work proposes to separately learn a safety editor policy that transforms potentially unsafe actions output by a utility maximizer policy into safe ones, and demonstrates outstanding utility performance while complying with the constraints.
Safe Reinforcement Learning Using Advantage-Based Intervention
TLDR
This work proposes a new algorithm, SAILR, that uses an intervention mechanism based on advantage functions to keep the agent safe throughout training and optimizes the agent’s policy using off-the-shelf RL algorithms designed for unconstrained MDPs.
SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition
TLDR
This work theoretically characterize why SAFER can enforce safe policy learning and demonstrate its effectiveness on several complex safety-critical robotic grasping tasks inspired by the game Operation, in which SAFER outperforms baseline methods in learning successful policies and enforcing safety.
Accelerating Safe Reinforcement Learning with Constraint-mismatched Baseline Policies
TLDR
An iterative policy optimization algorithm that alternates between maximizing expected return on the task, minimizing distance to the baseline policy, and projecting the policy onto the constraintsatisfying set is proposed.
LS3: Latent Space Safe Sets for Long-Horizon Visuomotor Control of Iterative Tasks
TLDR
Latent Space Safe Sets is presented, which extends this strategy to iterative, long-horizon tasks with image observations by using suboptimal demonstrations and a learned dynamics model to restrict exploration to the neighborhood of a learned Safe Set where task completion is likely.
Safe Model-Based Reinforcement Learning Using Robust Control Barrier Functions
TLDR
This paper frames safety as a differentiable robust-control-barrier-function layer in a model-based RL framework that ensures safety and effectively guides exploration during training resulting in increased sample efficiency as demonstrated in the experiments.
Learning Barrier Certificates: Towards Safe Reinforcement Learning with Zero Training-time Violations
TLDR
This paper proposes an algorithm, Co-trained Barrier Certificate for Safe RL (CRABS), which iteratively learns barrier certificates, dynamics models, and policies and adds a regularization term to encourage larger certified regions to enable better exploration.
Constrained Variational Policy Optimization for Safe Reinforcement Learning
TLDR
This paper overcomes the issues from a novel probabilistic inference perspective and proposes an ExpectationMaximization style approach to learn safe policy, showing the unique advantages of constrained variational policy optimization by proving its optimality and policy improvement stability.
MESA: Offline Meta-RL for Safe Adaptation and Fault Tolerance
TLDR
Simulation experiments across 5 continuous control domains suggest that MESA can leverage offline data from a range of different environments to reduce constraint violations in unseen environments by up to a factor of 2 while maintaining task performance.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 43 REFERENCES
Learning to be Safe: Deep RL with a Safety Critic
TLDR
This work proposes to learn how to be safe in one set of tasks and environments, and then use that learned intuition to constrain future behaviors when learning new, modified tasks, and empirically studies this form of safety-constrained transfer learning in three challenging domains.
Benchmarking Safe Exploration in Deep Reinforcement Learning
TLDR
This work proposes to standardize constrained RL as the main formalism for safe exploration, and presents the Safety Gym benchmark suite, a new slate of high-dimensional continuous control environments for measuring research progress on constrained RL.
Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning
TLDR
This work proposes an autonomous method for safe and efficient reinforcement learning that simultaneously learns a forward and reset policy, with the reset policy resetting the environment for a subsequent attempt.
Worst Cases Policy Gradients
TLDR
This work proposes an actor-critic framework that models the uncertainty of the future and simultaneously learns a policy based on that uncertainty model, and optimize policies for varying levels of conditional Value-at-Risk.
Constrained Policy Optimization
TLDR
Constrained Policy Optimization (CPO) is proposed, the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration, and allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training.
Safe Reinforcement Learning via Shielding
TLDR
A new approach to learn optimal policies while enforcing properties expressed in temporal logic by synthesizing a reactive system called a shield, which monitors the actions from the learner and corrects them only if the chosen action causes a violation of the specification.
A Lyapunov-based Approach to Safe Reinforcement Learning
TLDR
This work defines and presents a method for constructing Lyapunov functions, which provide an effective way to guarantee the global safety of a behavior policy during training via a set of local, linear constraints.
Safe Exploration in Finite Markov Decision Processes with Gaussian Processes
TLDR
A novel algorithm is developed and proved that it is able to completely explore the safely reachable part of the MDP without violating the safety constraint, and is demonstrated on digital terrain models for the task of exploring an unknown map with a rover.
Reward Constrained Policy Optimization
TLDR
This work presents a novel multi-timescale approach for constrained policy optimization, called `Reward Constrained Policy Optimization' (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one.
Lyapunov-based Safe Policy Optimization for Continuous Control
TLDR
Safe policy optimization algorithms based on a Lyapunov approach to solve continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations are presented.
...
1
2
3
4
5
...