• Corpus ID: 12494117

Safe Policy Improvement by Minimizing Robust Baseline Regret

@inproceedings{Ghavamzadeh2016SafePI,
  title={Safe Policy Improvement by Minimizing Robust Baseline Regret},
  author={Mohammad Ghavamzadeh and Marek Petrik and Yinlam Chow},
  booktitle={NIPS},
  year={2016}
}
An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, i.e., a policy that is guaranteed to perform at least as well as a given baseline strategy. In this paper, we develop and analyze a new model-based approach to compute a safe policy when we have access to an inaccurate dynamics model of the system with known accuracy guarantees. Our proposed robust method uses this (inaccurate) model to directly minimize the (negative) regret w… 

Figures from this paper

Safe Policy Improvement with Baseline Bootstrapping
TLDR
This paper adopts the safe policy improvement (SPI) approach, inspired by the knows-what-it-knows paradigms, and develops two computationally efficient bootstrapping algorithms, a value-based and a policy-based, both accompanied with theoretical SPI bounds.
Soft Safe Policy Improvement with Baseline Bootstrapping
TLDR
The method takes the right amount of risk to try uncertain actions all the while remaining safe in practice, and therefore is less conservative than the state-of-the-art methods.
Safe Policy Learning from Observations
TLDR
A stochastic policy improvement algorithm, termed Rerouted Behavior Improvement (RBI), that safely improves the average behavior and its primary advantages are its stability in the presence of value estimation errors and the elimination of a policy search process.
Safe Policy Improvement Approaches on Discrete Markov Decision Processes
TLDR
A taxonomy of SPI algorithms is introduced and empirically show an interesting property of two classes of SPI algorithm: while the mean performance of algorithms that incorporate the uncertainty as a penalty on the action-value is higher, actively restricting the set of policies more consistently produces good policies and is, thus, safer.
Safe Policy Improvement with Soft Baseline Bootstrapping
TLDR
This work improves more precisely the SPI with Baseline Bootstrapping algorithm (SPIBB) by allowing the policy search over a wider set of policies and adopts a softer strategy that controls the error in the value estimates by constraining the policy change according to the local model uncertainty.
Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety Constraints in Finite MDPs
TLDR
A novel method based on Safe Policy Iteration with Baseline Bootstrapping (SPIBB, Laroche et al., 2019) that provides high probability guarantees on the performance of the agent in the true environment is proposed.
Safe Policy Improvement with Baseline Bootstrapping in Factored Environments
TLDR
A novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative and can improve the policy using a number of samples potentially one order of magnitude smaller than the flat algorithm is presented.
OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation
TLDR
This paper presents an offline RL algorithm, OptiDICE, that directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms.
Soft-Robust Algorithms for Handling Model Misspecification
TLDR
The soft-robust criterion is established, its fundamental properties are established, it is shown that it is NP-hard to optimize, and two algorithms are proposed and analyzed to optimize it approximately.
Safe Policy Improvement with an Estimated Baseline Policy
TLDR
This paper applies SPIBB algorithms with a baseline estimate built from the data to show safe policy improvement guarantees over the true baseline even without direct access to it, and drastically and significantly outperforms competing algorithms both inSafe Policy Improvement, and in average performance.
...
...

References

SHOWING 1-10 OF 15 REFERENCES
Regret based Robust Solutions for Uncertain Markov Decision Processes
TLDR
This paper provides algorithms that employ sampling to improve across multiple dimensions and provides comparisons against benchmark algorithms on two domains from literature to demonstrate the empirical effectiveness of these approaches.
Robust Dynamic Programming
  • G. Iyengar
  • Mathematics, Economics
    Math. Oper. Res.
  • 2005
TLDR
It is proved that when this set of measures has a certain "rectangularity" property, all of the main results for finite and infinite horizon DP extend to natural robust counterparts.
Safe Policy Iteration
TLDR
Two safe policy-iteration algorithms that differ in the way the next policy is chosen w.r.t. the current policy are proposed and compared with state-of-the-art approaches on some chain-walk domains and on the Blackjack card game.
High Confidence Policy Improvement
We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameters that require
Off-policy Model-based Learning under Unknown Factored Dynamics
TLDR
The G-SCOPE algorithm is introduced that evaluates a new policy based on data generated by the existing policy and is both computationally and sample efficient because it greedily learns to exploit factored structure in the dynamics of the environment.
Parametric regret in uncertain Markov decision processes
  • Huan Xu, Shie Mannor
  • Computer Science
    Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference
  • 2009
TLDR
It is shown that the problem of computing a minimax regret strategy is NP-hard and proposed algorithms to efficiently finding it under favorable conditions and it is proved that computing such a strategy can be done numerically in an efficient way.
Robust Markov Decision Processes
TLDR
This work considers robust MDPs that offer probabilistic guarantees in view of the unknown parameters to counter the detrimental effects of estimation errors and determines a policy that attains the highest worst-case performance over this confidence region.
RAAM: The Benefits of Robustness in Approximating Aggregated MDPs in Reinforcement Learning
We describe how to use robust Markov decision processes for value function approximation with state aggregation. The robustness serves to reduce the sensitivity to the approximation error of
High-Confidence Off-Policy Evaluation
TLDR
This paper proposes an off-policy method for computing a lower confidence bound on the expected return of a policy and provides confidences regarding the accuracy of their estimates.
Robust Control of Markov Decision Processes with Uncertain Transition Matrices
TLDR
This work considers a robust control problem for a finite-state, finite-action Markov decision process, where uncertainty on the transition matrices is described in terms of possibly nonconvex sets, and shows that perfect duality holds for this problem, and that it can be solved with a variant of the classical dynamic programming algorithm, the "robust dynamic programming" algorithm.
...
...