• Corpus ID: 119317451

Robust Policy Optimization with Baseline Guarantees

@article{Chow2015RobustPO,
  title={Robust Policy Optimization with Baseline Guarantees},
  author={Yinlam Chow and Marek Petrik and Mohammad Ghavamzadeh},
  journal={arXiv: Optimization and Control},
  year={2015}
}
Our goal is to compute a policy that guarantees improved return over a baseline policy even when the available MDP model is inaccurate. The inaccurate model may be constructed, for example, by system identification techniques when the true model is inaccessible. When the modeling error is large, the standard solution to the constructed model has no performance guarantees with respect to the true model. In this paper we develop algorithms that provide such performance guarantees and show a trade… 

Figures from this paper

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation
TLDR
Two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data are proposed.
Shaping Control Variates for Off-Policy Evaluation
We study the problem of off-policy evaluation in RL settings where reward signals are sparse. We introduce a new model-based control variate for variance reduction in off-policy evaluation inspired
POPCORN: Partially Observed Prediction COnstrained ReiNforcement Learning
TLDR
A new optimization objective is introduced that produces both high-performing policies and high-quality generative models, even when some observations are irrelevant for planning, and does so in batch off-policy settings that are typical in healthcare, when only retrospective data is available.
Bridging the Gap Between Simulation and Reality
TLDR
This thesis aims to bridge the gap between simulation and reality by developing methods for grounding simulation to reality and developed methods for assessing how well a policy learned in simulation will perform before it is executed in the real world.
Bridging the Gap Between Simulation and Reality ( Doctoral Consortium )
TLDR
This thesis aims to bridge the gap between simulation and reality by developing methods for grounding simulation to reality and developed methods for assessing how well a policy learned in simulation will perform before it is executed in the real world.
Combining Parametric and Nonparametric Models for Off-Policy Evaluation
We consider a model-based approach to perform batch off-policy evaluation in reinforcement learning. Our method takes a mixture-of-experts approach to combine parametric and non-parametric models of
High Confidence Off-Policy Evaluation with Models
TLDR
This work proposes two bootstrapping approaches combined with learned MDP transition models in order to efficiently estimate lower confidence bounds on policy performance with limited data in both continuous and discrete state spaces and derives a theoretical upper bound on model bias.

References

SHOWING 1-10 OF 11 REFERENCES
Robust Dynamic Programming
  • G. Iyengar
  • Mathematics, Economics
    Math. Oper. Res.
  • 2005
TLDR
It is proved that when this set of measures has a certain "rectangularity" property, all of the main results for finite and infinite horizon DP extend to natural robust counterparts.
Regret based Robust Solutions for Uncertain Markov Decision Processes
TLDR
This paper provides algorithms that employ sampling to improve across multiple dimensions and provides comparisons against benchmark algorithms on two domains from literature to demonstrate the empirical effectiveness of these approaches.
Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems
TLDR
This paper proposes a model that describes uncertainty in both the distribution form (discrete, Gaussian, exponential, etc.) and moments (mean and covariance matrix) and demonstrates that for a wide range of cost functions the associated distributionally robust stochastic program can be solved efficiently.
Action Elimination and Stopping Conditions for Reinforcement Learning
TLDR
A model-based and a model-free variants of the elimination method that derive stopping conditions that guarantee that the learned policy is approximately optimal with high probability and demonstrates a considerable speedup and added robustness.
Constrained Markov Decision Processes
INTRODUCTION Examples of Constrained Dynamic Control Problems On Solution Approaches for CMDPs with Expected Costs Other Types of CMDPs Cost Criteria and Assumptions The Convex Analytical Approach
Envelope Theorems for Arbitrary Choice Sets
The standard envelope theorems apply to choice sets with convex and topological structure, providing sufficient conditions for the value function to be differentiable in a parameter and
Markov Decision Processes: Discrete Stochastic Dynamic Programming
  • M. Puterman
  • Computer Science
    Wiley Series in Probability and Statistics
  • 1994
TLDR
Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.
Markov decision processes with uncertain transition rates: sensitivity and robust control
TLDR
This paper considers Markov decision problems with uncertain transition rates represented as compact sets, and develops solution techniques for the problem of obtaining the max-min optimal policy, which maximizes the worst-case average per-unit-time reward.
Nonlinear Programming
Robust Markov Decision Processes
TLDR
This work considers robust MDPs that offer probabilistic guarantees in view of the unknown parameters to counter the detrimental effects of estimation errors and determines a policy that attains the highest worst-case performance over this confidence region.
...
...