• Corpus ID: 235417492

Policy Gradient Bayesian Robust Optimization for Imitation Learning

  title={Policy Gradient Bayesian Robust Optimization for Imitation Learning},
  author={Zaynah Javed and Daniel S. Brown and Satvik Sharma and Jerry Zhu and Ashwin Balakrishna and Marek Petrik and Anca D. Dragan and Ken Goldberg},
The difficulty in specifying rewards for many realworld problems has led to an increased focus on learning rewards from human feedback, such as demonstrations. However, there are often many different reward functions that explain the human feedback, leaving agents with uncertainty over what the true reward function is. While most policy optimization approaches handle this uncertainty by optimizing for expected performance, many applications demand risk-averse behavior. We derive a novel policy… 

RASR: Risk-Averse Soft-Robust MDPs with EVaR and Entropic Risk

Prior work on safe Reinforcement Learning (RL) has studied risk-aversion to randomness in dynamics (aleatory) and to model uncertainty (epistemic) in isolation. We propose and analyze a new framework

A Study of Causal Confusion in Preference-Based Reward Learning

Evidence that learning rewards from pairwise trajectory preferences is highly sensitive and non-robust to spurious features and increasing model capacity, but not as sensitive to the type of training data is presented.

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

An anomaly detection task for aberrant policies is proposed and several baseline detectors are offered for phase transitions: capability thresholds at which the agent’s behavior qualitatively shifts, leading to a sharp decrease in the true reward.

On the convex formulations of robust Markov decision processes

This work describes the first convex optimization formulation of RMDPs under the classical sa-rectangularity and s- rectangularity assumptions and derives a convex formulation with a linear number of variables and constraints but large coefficients in the constraints by using entropic regularization and exponential change of variables.

A General Framework for quantifying Aleatoric and Epistemic uncertainty in Graph Neural Networks

This work considers the problem of quantifying the uncertainty in predictions of GNN stemming from modeling errors and measurement uncertainty, and proposes an approach to treat both sources of uncertainty in a Bayesian framework, where Assumed Density Filtering is used to quantify aleatoric uncertainty and Monte Carlo dropout captures uncertainty in model parameters.



Bayesian Robust Optimization for Imitation Learning

BROIL leverages Bayesian reward function inference and a user specific risk tolerance to efficiently optimize a robust policy that balances expected return and conditional value at risk and outperforms existing risk-sensitive and risk-neutral inverse reinforcement learning algorithms.

Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences

Bayesian Reward Extrapolation (Bayesian REX) is proposed, a highly efficient Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference.

Constrained Policy Optimization

Constrained Policy Optimization (CPO) is proposed, the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration, and allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training.

Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning

This risk- averse TD3 is the first to introduce deterministic policies and off-policy learning into risk-averse reinforcement learning, both of which are key to the performance boost the authors show in Mujoco domains.

Apprenticeship learning via inverse reinforcement learning

This work thinks of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and gives an algorithm for learning the task demonstrated by the expert, based on using "inverse reinforcement learning" to try to recover the unknown reward function.

Worst Cases Policy Gradients

This work proposes an actor-critic framework that models the uncertainty of the future and simultaneously learns a policy based on that uncertainty model, and optimize policies for varying levels of conditional Value-at-Risk.

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning

A sampling method based on Bayesian inverse reinforcement learning that uses demonstrations to determine practical high-confidence upper bounds on the alpha-worst-case difference in expected return between any evaluation policy and the optimal policy under the expert's unknown reward function is proposed.

Risk-sensitive Inverse Reinforcement Learning via Coherent Risk Models

Comparisons of the RiskSensitive (RS) IRL approach with a risk-neutral model show that the RS-IRL framework more accurately captures observed participant behavior both qualitatively and quantitatively.

Learning a Prior over Intent via Meta-Inverse Reinforcement Learning

This work exploits the insight that demonstrations from other tasks can be used to constrain the set of possible reward functions by learning a "prior" that is specifically optimized for the ability to infer expressive reward functions from limited numbers of demonstrations.

Efficient Reductions for Imitation Learning

This work proposes two alternative algorithms for imitation learning where training occurs over several episodes of interaction and shows that this leads to stronger performance guarantees and improved performance on two challenging problems: training a learner to play a 3D racing game and Mario Bros.