• Corpus ID: 219966528

Accelerating Safe Reinforcement Learning with Constraint-mismatched Policies

@article{Yang2020AcceleratingSR,
  title={Accelerating Safe Reinforcement Learning with Constraint-mismatched Policies},
  author={Tsung-Yen Yang and Justinian P. Rosca and Karthik Narasimhan and Peter J. Ramadge},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.11645}
}
We consider the problem of reinforcement learning when provided with (1) a baseline control policy and (2) a set of constraints that the controlled system must satisfy. The baseline policy can arise from a teacher agent, demonstration data or even a heuristic while the constraints might encode safety, fairness or other application-specific requirements. Importantly, the baseline policy may be sub-optimal for the task at hand, and is not guaranteed to satisfy the specified constraints. The key… 

Safe Reinforcement Learning with Natural Language Constraints

TLDR
This paper develops a model that contains a constraint interpreter to encode natural language constraints into vector representations capturing spatial and temporal information on forbidden states, and a policy network that uses these representations to output a policy with minimal constraint violations.

Guided Safe Shooting: model based reinforcement learning with safety constraints

TLDR
This paper introduces Guided Safe Shooting (GuSS), a model-based RL approach that can learn to control systems with minimal violations of the safety constraints, and proposes three different safe planners, one based on a simple random shooting strategy, two based on MAP-Elites, a more advanced divergent-search algorithm.

CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning

TLDR
This paper proposes CUP, a Conservative Update Policy algorithm with a theoretical safety guarantee based on the new proposed performance bounds and surrogate functions, and provides a non-convex implementation via first-order optimizers, which does not depend on any convex approximation.

SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition

TLDR
This work theoretically characterize why SAFER can enforce safe policy learning and demonstrate its effectiveness on several complex safety-critical robotic grasping tasks inspired by the game Operation, 2 in which SAFER outperforms state-of-the-art primitive learning methods in success and safety.

Improving Safety in Deep Reinforcement Learning using Unsupervised Action Planning

TLDR
This work proposes a novel technique of unsupervised action planning to improve the safety of on-policy reinforcement learning algorithms, such as trust region policy optimization (TRPO) or proximalpolicy optimization (PPO).

Safe Reinforcement Learning by Imagining the Near Future

TLDR
This work devise a model-based algorithm that heavily penalizes unsafe trajectories, and derive guarantees that this algorithm can avoid unsafe states under certain assumptions, and demonstrates that the algorithm can achieve competitive rewards with fewer safety violations in several continuous control tasks.

Learning Barrier Certificates: Towards Safe Reinforcement Learning with Zero Training-time Violations

TLDR
This paper proposes an algorithm, Co-trained Barrier Certificate for Safe RL (CRABS), which iteratively learns barrier certificates, dynamics models, and policies and adds a regularization term to encourage larger certified regions to enable better exploration.

References

SHOWING 1-10 OF 47 REFERENCES

Constrained Policy Optimization

TLDR
Constrained Policy Optimization (CPO) is proposed, the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration, and allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training.

Batch Policy Learning under Constraints

TLDR
A new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds is proposed and achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving.

Lyapunov-based Safe Policy Optimization for Continuous Control

TLDR
Safe policy optimization algorithms based on a Lyapunov approach to solve continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations are presented.

Projection-Based Constrained Policy Optimization

TLDR
This paper proposes a new algorithm - Projection Based ConstrainedPolicy Optimization (PCPO), an iterative method for optimizing policies in a two-step process - the first step performs an unconstrained update while the second step reconciles the constraint violation by projection the policy back onto the constraint set.

Safe Exploration in Continuous Action Spaces

TLDR
This work addresses the problem of deploying a reinforcement learning agent on a physical system such as a datacenter cooling unit or robot, where critical constraints must never be violated, and directly adds to the policy a safety layer that analytically solves an action correction formulation per each state.

Reward Constrained Policy Optimization

TLDR
This work presents a novel multi-timescale approach for constrained policy optimization, called `Reward Constrained Policy Optimization' (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one.

On-Policy Robot Imitation Learning from a Converging Supervisor

TLDR
Experiments suggest that when this framework is applied with the state-of-the-art deep model-based RL algorithm PETS as an improving supervisor, it outperforms deep RL baselines on continuous control tasks and provides up to an 80-fold speedup in policy evaluation.

Responsive Safety in Reinforcement Learning by PID Lagrangian Methods

TLDR
This work proposes a novel Lagrange multiplier update method that utilizes derivatives of the constraint function, and introduces a new method to ease controller tuning by providing invariance to the relative numerical scales of reward and cost.

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

TLDR
A general and model-free approach for Reinforcement Learning on real robotics with sparse rewards built upon the Deep Deterministic Policy Gradient algorithm to use demonstrations that out-performs DDPG, and does not require engineered rewards.

Reinforcement Learning from Imperfect Demonstrations

TLDR
This work proposes a unified reinforcement learning algorithm, Normalized Actor-Critic (NAC), that effectively normalizes the Q-function, reducing theQ-values of actions unseen in the demonstration data, making NAC robust to suboptimal demonstration data.