Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes

  title={Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes},
  author={Miao Lu and Yifei Min and Zhaoran Wang and Zhuoran Yang},
We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the Proxy variable Pessimistic Policy Optimization (P3O… 

Figures and Tables from this paper

Strategic Decision-Making in the Presence of Information Asymmetry: Provably Efficient RL with Algorithmic Instruments

A novel algorithm, pessimistic policy learning with algorithmic instruments (PLAN), which leverages the ideas of instrumental variable regression and the pessimism principle to learn a near-optimal principal’s policy in the context of general function approximation is proposed.

Blessing from Experts: Super Reinforcement Learning in Confounded Environments

To address the issue of unmeasured confounding in super-policies, a number of non-parametric identification results are established and two super-policy learning algorithms are developed and derive their correspondingite-sample regret guarantees.

Model-Based Reinforcement Learning Is Minimax-Optimal for Offline Zero-Sum Markov Games

A pessimistic model-based algorithm with Bernstein-style lower confidence bounds is proposed that provably finds an ε-approximate Nash equilibrium with a sample complexity no larger than C clippedS(A+B) (1−γ)3ε2 (up to some log factor).

Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information

A novel identification result is developed and used to propose a new off-policy evaluation (OPE) method for evaluating policy pairs in this two-player turn-based game and it is proved that under mild assumptions such as partial coverage of the o⬄ine data, the policy pair obtained through the method converges to the optimal one at a satisfactory rate.

Statistical Estimation of Confounded Linear MDPs: An Instrumental Variable Approach

In an Markov decision process (MDP), unobservable confounders may exist and have impacts on the data generating process, so that the classic off-policy evaluation (OPE) estimators may fail to identify



Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

This work considers off-policy evaluation in a partially observed MDP (POMDP) by considering estimating the value of a given target policy in a POMDP given trajectories with only partial state observations generated by a different and unknown policy that may depend on the unobserved state.

Is Pessimism Provably Efficient for Offline RL?

A pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function and establishes a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs).

A Minimax Learning Approach to Off-Policy Evaluation in Partially Observable Markov Decision Processes

This work proposes novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy’s value and the observed data distribution, and proposes minimax estimation methods for learning these bridge functions.

Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders

It is shown how, given only a latent variable model for states and actions, policy value can be identified from off-policy data, and optimal balancing can be combined with such learned ratios to obtain policy value while avoiding direct modeling of reward functions.

Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning

A robust approach is developed that estimates sharp bounds on the (unidentifiable) value of a given policy in an infinite-horizon problem given data from another policy with unobserved confounding, subject to a sensitivity model.

When Is Partially Observable Reinforcement Learning Not Scary?

It is proved that for weakly revealing POMDPs, a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to guarantee polynomial sample complexity.

Minimax-Optimal Policy Learning Under Unobserved Confounding

It is demonstrated that hidden confounding can hinder existing policy-learning approaches and lead to unwarranted harm although the robust approach guarantees safety and focuses on well-evidenced improvement, a necessity for making personalized treatment policies learned from observational data reliable in practice.

Pessimistic Model-based Offline RL: PAC Bounds and Posterior Sampling under Partial Coverage

It is demonstrated that this algorithmic framework can be applied to many specialized Markov Decision Processes where the additional structural assumptions can further refine the concept of partial coverage.

Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

A new offline RL framework is presented, called single-policy concentrability, that smoothly interpolates between the two extremes of data composition, hence unifying imitation learning and vanilla offline RL.

Sample-Efficient Reinforcement Learning of Undercomplete POMDPs

This work presents a sample-efficient algorithm, OOM-UCB, for episodic finite undercomplete POMDPs, where the number of observations is larger than thenumber of latent states and where exploration is essential for learning, thus distinguishing the results from prior works.