• Publications
  • Influence
More Robust Doubly Robust Off-policy Evaluation
TLDR
This paper proposes alternative DR estimators, called more robust doubly robust (MRDR), that learn the model parameter by minimizing the variance of the DR estimator, and proves that the MRDR estimators are strongly consistent and asymptotically optimal.
DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
TLDR
This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.
Risk-Constrained Reinforcement Learning with Percentile Risk Criteria
TLDR
This paper derives a formula for computing the gradient of the Lagrangian function for percentile risk-constrained Markov decision processes and devise policy gradient and actor-critic algorithms that estimate such gradient, update the policy in the descent direction, and update the Lagrange multiplier in the ascent direction.
AlgaeDICE: Policy Gradient from Arbitrary Experience
TLDR
A new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution and shows that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-Policy objective is exactly the on-policy policy gradient, without any use of importance weighting.
Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach
TLDR
This paper shows that a CVaR objective, besides capturing risk sensitivity, has an alternative interpretation as expected cost under worst-case modeling errors, for a given error budget, and presents an approximate value-iteration algorithm forCVaR MDPs and analyzes its convergence rate.
A Lyapunov-based Approach to Safe Reinforcement Learning
TLDR
This work defines and presents a method for constructing Lyapunov functions, which provide an effective way to guarantee the global safety of a behavior policy during training via a set of local, linear constraints.
Algorithms for CVaR Optimization in MDPs
TLDR
This paper first derive a formula for computing the gradient of this risk-sensitive objective function, then devise policy gradient and actor-critic algorithms that each uses a specific method to estimate this gradient and updates the policy parameters in the descent direction.
Lyapunov-based Safe Policy Optimization for Continuous Control
TLDR
Safe policy optimization algorithms based on a Lyapunov approach to solve continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations are presented.
Safe Policy Improvement by Minimizing Robust Baseline Regret
TLDR
This paper develops and analyzes a new model-based approach to compute a safe policy when the authors have access to an inaccurate dynamics model of the system with known accuracy guarantees, and uses this model to directly minimize the (negative) regret w.r.t. the baseline policy.
Weighted SGD for ℓp Regression with Randomized Preconditioning
TLDR
A hybrid algorithm named pwSGD is proposed that uses RLA techniques for preconditioning and constructing an importance sampling distribution, and then performs an SGD-like iterative process with weighted sampling on the preconditionsed system that inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity.
...
1
2
3
4
5
...