• Corpus ID: 219965619

Provably Efficient Causal Reinforcement Learning with Confounded Observational Data

  title={Provably Efficient Causal Reinforcement Learning with Confounded Observational Data},
  author={Lingxiao Wang and Zhuoran Yang and Zhaoran Wang},
Empowered by expressive function approximators such as neural networks, deep reinforcement learning (DRL) achieves tremendous empirical successes. However, learning expressive function approximators requires collecting a large dataset (interventional data) by interacting with the environment. Such a lack of sample efficiency prohibits the application of DRL to critical scenarios, e.g., autonomous driving and personalized medicine, since trial and error in the online setting is often unsafe and… 

Figures from this paper

Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

This work considers off-policy evaluation in a partially observed MDP (POMDP) by considering estimating the value of a given target policy in a POMDP given trajectories with only partial state observations generated by a different and unknown policy that may depend on the unobserved state.

On Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning

This work considers the problem of using expert data with unobserved confounders for imitation and reinforcement learning, and proposes a sampling procedure that addresses the unknown shift and proves convergence to an optimal solution.

Efficient Reinforcement Learning with Prior Causal Knowledge

The causal upper confidence bound value iteration (C-UCBVI) algorithm is proposed that exploits the causal structure in C-MDPs and improves the performance of standard reinforcement learning algorithms that do not take causal knowledge into account and the causal factored UCBVI algorithm, which further reduces the regret exponentially in terms of S.

A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes

We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on

A Minimax Learning Approach to Off-Policy Evaluation in Partially Observable Markov Decision Processes

This work proposes novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy’s value and the observed data distribution, and proposes minimax estimation methods for learning these bridge functions.

Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework

A reinforcement learning framework for carrying A/B testing in these experiments, while characterizing the long-term treatment effects is introduced and systematically investigates the theoretical properties of the testing procedure.

A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning

It is empirically show that ˆ Z estimated by this method can significantly reduce dynamics prediction errors and improve the performance of model-based RL methods on zero-shot new environments with unseen dynamics.

Reinforcement Learning of Causal Variables Using Mediation Analysis

A parsimonious causal graph is obtained in which interventions occur at the level of policies in which causal variables and policies are determined by maximizing a new optimization target inspired by mediation analysis, which differs from the expected return.

Dynamic Bottleneck for Robust Self-Supervised Exploration

A Dynamic Bottleneck (DB) model is proposed, which attains a dynamics-relevant representation based on the information-bottleneck principle, and which encourages the agent to explore state-action pairs with high information gain, which outperforms several state-of-the-art exploration methods in noisy environments.

Statistical Estimation of Confounded Linear MDPs: An Instrumental Variable Approach

In an Markov decision process (MDP), unobservable confounders may exist and have impacts on the data generating process, so that the classic off-policy evaluation (OPE) estimators may fail to identify



Deconfounding Reinforcement Learning in Observational Settings

This work considers the problem of learning good policies solely from historical data in which unobserved factors affect both observed actions and rewards, and for the first time that confounders are taken into consideration for addressing full RL problems with observational data.

Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders

It is shown how, given only a latent variable model for states and actions, policy value can be identified from off-policy data, and optimal balancing can be combined with such learned ratios to obtain policy value while avoiding direct modeling of reward functions.

Counterfactual Data-Fusion for Online Reinforcement Learners

This work provides a recipe for combining multiple datasets to accelerate learning in a variant of the Multi-Armed Bandit problem with Unobserved Confounders (MABUC) and demonstrates its efficacy with extensive simulations.

Environment Reconstruction with Hidden Confounders for Reinforcement Learning based Recommendation

DemER adopts a multi-agent generative adversarial imitation learning framework, and proposes to introduce the confounder embedded policy, and use the compatible discriminator for training the policies, and derives a recommendation policy with a significantly improved performance in the test phase of the real application.

Causal Confusion in Imitation Learning

It is shown that causal misidentification occurs in several benchmark control domains as well as realistic driving settings, and the proposed solution to combat it through targeted interventions to determine the correct causal model is validated.

Is Q-learning Provably Efficient?

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically

Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes

This paper develops the first adaptive algorithm that achieves near-optimal regret in DTRs in online settings, without any access to historical data, and develops a novel RL algorithm that efficiently learns the optimal DTR while leveraging the abundant, yet imperfect confounded observations.

Confounding-Robust Policy Improvement

It is demonstrated that hidden confounding can hinder existing policy learning approaches and lead to unwarranted harm, while the robust approach guarantees safety and focuses on well-evidenced improvement, a necessity for making personalized treatment policies learned from observational data reliable in practice.

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that

Provably Efficient Reinforcement Learning with Linear Function Approximation

This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.