# Robust Batch Policy Learning in Markov Decision Processes

@article{Qi2020RobustBP, title={Robust Batch Policy Learning in Markov Decision Processes}, author={Zhengling Qi and Peng Liao}, journal={ArXiv}, year={2020}, volume={abs/2011.04185} }

We study the sequential decision making problem in Markov decision process (MDP) where each policy is evaluated by a set containing average rewards over different horizon lengths and with different initial distributions. Given a pre-collected dataset of multiple trajectories generated by some behavior policy, our goal is to learn a robust policy in a pre-specified policy class that can maximize the smallest value of this set. Leveraging the semi-parametric efficiency theory from statistics, we…

## 3 Citations

### Online Policy Optimization for Robust MDP

- Computer ScienceArXiv
- 2022

This work considers online robust MDP by interacting with an unknown nominal system, and proposes a robust optimistic policy optimization algorithm that is provably efﬁcient under a more realistic online setting.

### Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process

- Computer ScienceJournal of the American Statistical Association
- 2022

It is shown that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy’s value is identiﬁable in a confounded Markov decision process and an eﬃcient oﬀ-policy value estimator is developed that is robust to potential model misspeciﬂcation and provide rigorous uncertainty quanti-�cation.

### Towards Robust Off-Policy Evaluation via Human Inputs

- Computer ScienceAIES
- 2022

This work proposes a novel framework, Robust OPE (ROPE), which considers shifts on a subset of covariates in the data based on user inputs, and estimates worst-case utility under these shifts, and develops computationally efficient algorithms that are robust to the aforementioned shifts for contextual bandits and Markov decision processes.

## References

SHOWING 1-10 OF 74 REFERENCES

### Batch Policy Learning in Average Reward Markov Decision Processes

- Computer Science
- 2020

This work proposes a doubly robust estimator for the average reward of a batch policy learning problem in the infinite horizon Markov Decision Process and develops an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class.

### Actor-Critic Algorithms for Risk-Sensitive MDPs

- Computer ScienceNIPS
- 2013

This paper considers both discounted and average reward Markov decision processes and devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction, which establish the convergence of the algorithms to locally risk-sensitive optimal policies.

### Mean-Variance Optimization in Markov Decision Processes

- Computer ScienceICML
- 2011

It is proved that the complexity of computing a policy that maximizes the mean reward under a variance constraint is NP-hard for some cases, and strongly NP- hard for others.

### Infinite-Horizon Policy-Gradient Estimation

- Computer ScienceJ. Artif. Intell. Res.
- 2001

GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies, is introduced.

### Variance-Penalized Markov Decision Processes

- MathematicsMath. Oper. Res.
- 1989

We consider a Markov decision process with both the expected limiting average, and the discounted total return criteria, appropriately modified to include a penalty for the variability in the stream…

### Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes

- Computer ScienceJ. Mach. Learn. Res.
- 2020

This work considers for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless, and develops a new estimator based on cross-fold estimation of $q-functions and marginalized density ratios, which is term double reinforcement learning (DRL).

### Distributionally Robust Markov Decision Processes

- Mathematics, Computer ScienceMath. Oper. Res.
- 2010

It is shown that finding the optimal distributionally robust strategy can be reduced to the standard robust MDP where parameters are known to belong to a single uncertainty set; hence, it can be computed in polynomial time under mild technical conditions.

### Statistical inference of the value function for reinforcement learning in infinite‐horizon settings

- Computer ScienceJournal of the Royal Statistical Society: Series B (Statistical Methodology)
- 2021

The focus of this paper was to construct confidence intervals (CIs) for a policy’s value in infinite horizon settings where the number of decision points diverges to infinity and it is shown that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique.

### Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes

- EconomicsArXiv
- 2019

A new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently.

### Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

- Computer ScienceICML
- 2020

An easily computable confidence bound for the policy evaluator is provided, which may be useful for optimistic planning and safe policy improvement, and establishes a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound.