• Corpus ID: 235658155

Causal Reinforcement Learning using Observational and Interventional Data

  title={Causal Reinforcement Learning using Observational and Interventional Data},
  author={Maxime Gasse and Damien Grasset and Guillaume Gaudron and Pierre-Yves Oudeyer},
Learning efficiently a causal model of the environment is a key challenge of model-based RL agents operating in POMDPs. We consider here a scenario where the learning agent has the ability to collect online experiences through direct interactions with the environment (interventional data), but has also access to a large collection of offline experiences, obtained by observing another agent interacting with the environment (observational data). A key ingredient, that makes this situation non… 
Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes
This work considers off-policy evaluation in a partially observed MDP (POMDP) by considering estimating the value of a given target policy in a POMDP given trajectories with only partial state observations generated by a different and unknown policy that may depend on the unobserved state.
A Survey of Deep Reinforcement Learning in Recommender Systems: A Systematic Review and Future Directions
This survey provides a taxonomy of current DRL-based recommender systems and a summary of existing methods, and discusses emerging topics and open issues, and provides the perspective on advancing the domain.
Causal Multi-Agent Reinforcement Learning: Review and Open Problems
It is argued that causality can offer improved safety, interpretability, and robustness, while also providing strong theoretical guarantees for emergent behaviour.


Deconfounding Reinforcement Learning in Observational Settings
This work considers the problem of learning good policies solely from historical data in which unobserved factors affect both observed actions and rewards, and for the first time that confounders are taken into consideration for addressing full RL problems with observational data.
Reinforcement Learning and Causal Models
This chapter reviews the diverse roles that causal knowledge plays in reinforcement learning. The first half of the chapter contrasts a “model-free” system that learns to repeat actions that lead to
Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders
It is shown how, given only a latent variable model for states and actions, policy value can be identified from off-policy data, and optimal balancing can be combined with such learned ratios to obtain policy value while avoiding direct modeling of reward functions.
Causal Confusion in Imitation Learning
It is shown that causal misidentification occurs in several benchmark control domains as well as realistic driving settings, and the proposed solution to combat it through targeted interventions to determine the correct causal model is validated.
Transfer Learning in Multi-Armed Bandit: A Causal Approach
This work tackles the problem of transferring knowledge across bandit agents in settings where causal effects cannot be identified by Pearl's {do-calculus} nor standard off-policy learning techniques, and proposes a new identification strategy, B-kl-UCB.
Designing Optimal Dynamic Treatment Regimes: A Causal Reinforcement Learning Approach
It is shown that, if the causal diagram of the underlying environment is provided, one could achieve regret that is exponentially smaller than DX∪S, and two online algorithms are developed that satisfy such regret bounds by exploiting the causal structure underlying the DTR.
Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes
This paper develops the first adaptive algorithm that achieves near-optimal regret in DTRs in online settings, without any access to historical data, and develops a novel RL algorithm that efficiently learns the optimal DTR while leveraging the abundant, yet imperfect confounded observations.
Batch Reinforcement Learning
This chapter introduces the basic principles and the theory behind batch reinforcement learning, the most important algorithms, exemplarily discuss ongoing research within this field, and briefly survey real-world applications ofbatch reinforcement learning.
Off-Policy Evaluation in Partially Observable Environments
A model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP is formulated, which shows how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general PomDPs.
Bandits with Unobserved Confounders: A Causal Approach
It is shown that to achieve low regret in certain realistic classes of bandit problems (namely, in the face of unobserved confounders), both experimental and observational quantities are required by the rational agent.