• Corpus ID: 226282299

Robust Batch Policy Learning in Markov Decision Processes

  title={Robust Batch Policy Learning in Markov Decision Processes},
  author={Zhengling Qi and Peng Liao},
We study the sequential decision making problem in Markov decision process (MDP) where each policy is evaluated by a set containing average rewards over different horizon lengths and with different initial distributions. Given a pre-collected dataset of multiple trajectories generated by some behavior policy, our goal is to learn a robust policy in a pre-specified policy class that can maximize the smallest value of this set. Leveraging the semi-parametric efficiency theory from statistics, we… 

Figures from this paper

Online Policy Optimization for Robust MDP

This work considers online robust MDP by interacting with an unknown nominal system, and proposes a robust optimistic policy optimization algorithm that is provably efficient under a more realistic online setting.

Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process

It is shown that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy’s value is identifiable in a confounded Markov decision process and an efficient off-policy value estimator is developed that is robust to potential model misspeciflcation and provide rigorous uncertainty quanti-�cation.

Towards Robust Off-Policy Evaluation via Human Inputs

This work proposes a novel framework, Robust OPE (ROPE), which considers shifts on a subset of covariates in the data based on user inputs, and estimates worst-case utility under these shifts, and develops computationally efficient algorithms that are robust to the aforementioned shifts for contextual bandits and Markov decision processes.



Batch Policy Learning in Average Reward Markov Decision Processes

This work proposes a doubly robust estimator for the average reward of a batch policy learning problem in the infinite horizon Markov Decision Process and develops an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class.

Actor-Critic Algorithms for Risk-Sensitive MDPs

This paper considers both discounted and average reward Markov decision processes and devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction, which establish the convergence of the algorithms to locally risk-sensitive optimal policies.

Mean-Variance Optimization in Markov Decision Processes

It is proved that the complexity of computing a policy that maximizes the mean reward under a variance constraint is NP-hard for some cases, and strongly NP- hard for others.

Infinite-Horizon Policy-Gradient Estimation

GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies, is introduced.

Variance-Penalized Markov Decision Processes

We consider a Markov decision process with both the expected limiting average, and the discounted total return criteria, appropriately modified to include a penalty for the variability in the stream

Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes

This work considers for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless, and develops a new estimator based on cross-fold estimation of $q-functions and marginalized density ratios, which is term double reinforcement learning (DRL).

Distributionally Robust Markov Decision Processes

It is shown that finding the optimal distributionally robust strategy can be reduced to the standard robust MDP where parameters are known to belong to a single uncertainty set; hence, it can be computed in polynomial time under mild technical conditions.

Statistical inference of the value function for reinforcement learning in infinite‐horizon settings

The focus of this paper was to construct confidence intervals (CIs) for a policy’s value in infinite horizon settings where the number of decision points diverges to infinity and it is shown that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique.

Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes

A new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently.

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

An easily computable confidence bound for the policy evaluator is provided, which may be useful for optimistic planning and safe policy improvement, and establishes a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound.