# Information-Theoretic Considerations in Batch Reinforcement Learning

@article{Chen2019InformationTheoreticCI, title={Information-Theoretic Considerations in Batch Reinforcement Learning}, author={Jinglin Chen and Nan Jiang}, journal={ArXiv}, year={2019}, volume={abs/1905.00360} }

Value-function approximation methods that operate in batch mode have foundational importance to reinforcement learning (RL). Finite sample guarantees for these methods often crucially rely on two types of assumptions: (1) mild distribution shift, and (2) representation conditions that are stronger than realizability. However, the necessity ("why do we need them?") and the naturalness ("when do they hold?") of such assumptions have largely eluded the literature. In this paper, we revisit these…

## 205 Citations

### Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison

- Computer ScienceUAI
- 2020

It is proved performance guarantees of two algorithms for approximating Q-Star in batch reinforcement learning and one of the algorithms uses a novel and explicit importance-weighting correction to overcome the infamous "double sampling" difficulty in Bellman error estimation.

### Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation

- Computer Science, MathematicsCOLT
- 2022

This work proves that in general, even if both concentrability and realizability are satisfied, any algorithm requires sample complexity polynomial in the size of the state space to learn a non-trivial policy, and highlights a phenomenon called over-coverage which serves as a fundamental barrier for offline value function approximation methods.

### Batch Value-function Approximation with Only Realizability

- Computer ScienceICML
- 2021

The algorithm, BVFT, breaks the hardness conjecture via a tournament procedure that reduces the learning problem to pairwise comparison, and solves the latter with the help of a state-action partition constructed from the compared functions.

### Provably Good Batch Reinforcement Learning Without Great Exploration

- Computer Science
- 2020

The necessity of the pessimistic update and the limitations of previous algorithms and analyses are highlighted by illustrative MDP examples and an empirical comparison of the algorithm and other state-of-the-art batch RL baselines in standard benchmarks are demonstrated.

### M ar 2 02 0 Q ⋆ Approximation Schemes for Batch Reinforcement Learning : A eoretical Comparison

- Computer Science
- 2020

It is proved performance guarantees of two algorithms for approximating Q⋆ in batch reinforcement learning, which use a novel and explicit importance-weighting correction to overcome the infamous “double sampling” difficulty in Bellman error estimation.

### Provably Good Batch Reinforcement Learning Without Great Exploration

- Computer Science
- 2020

It is shown that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees on the performance of the output policy, and in certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability.

### Provably Good Batch Reinforcement Learning Without Great Exploration

- Computer ScienceArXiv
- 2020

It is shown that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees on the performance of the output policy, and in certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability.

### Offline Reinforcement Learning Under Value and Density-Ratio Realizability: the Power of Gaps

- Computer ScienceUAI
- 2022

This work is the first to identify the utility and the novel mechanism of gap assumptions in ofﬂine RL with weak function approximation and provide guarantees to a simple pes-simistic algorithm based on a version space formed by marginalized importance sampling.

### Offline Reinforcement Learning with Realizability and Single-policy Concentrability

- Mathematics, Computer ScienceCOLT
- 2022

A simple algorithm based on the primal-dual formulation of MDPs, where the dual variables are mod-eled using a density-ratio function against ofﬂine data and enjoys polynomial sample complexity, under only realizability and single-policy concentrability.

### What are the Statistical Limits of Offline RL with Linear Function Approximation?

- Computer ScienceICLR
- 2021

The results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).

## References

SHOWING 1-10 OF 53 REFERENCES

### An upper bound on the loss from approximate optimal-value functions

- Computer ScienceMachine Learning
- 2004

An upper bound on performance loss is derived that is slightly tighter than that in Bertsekas (1987), and the extension of the bound to Q-learning is shown to provide a partial theoretical rationale for the approximation of value functions.

### Regularization in reinforcement learning

- Computer Science
- 2011

It is proved that the regularization-based Approximate Value/Policy Iteration algorithms introduced in this thesis enjoys an oracle-like property and it may be used to achieve adaptivity: the performance is almost as good as the performance of the unknown best parameters.

### PAC Reinforcement Learning with Rich Observations

- Computer ScienceNIPS
- 2016

A new model for reinforcement learning with rich observations is proposed, generalizing contextual bandits to sequential decision making and it is proved that the algorithm learns near optimal behavior after a number of episodes that is polynomial in all relevant parameters, logarithmic in the number of policies, and independent of the size of the observation space.

### On Oracle-Efficient PAC RL with Rich Observations

- Computer ScienceNeurIPS
- 2018

With stochastic hidden state dynamics, it is proved that the only known sample-efficient algorithm, OLIVE, cannot be implemented in the oracle model, and new provably sample- efficient algorithms are presented.

### Abstraction Selection in Model-based Reinforcement Learning

- Computer ScienceICML
- 2015

This paper proposes a simple algorithm based on statistical hypothesis testing that comes with a finite-sample guarantee under assumptions on candidate abstractions, resulting in a loss bound that depends only on the quality of the best available abstraction and is polynomial in planning horizon.

### On the sample complexity of reinforcement learning.

- Computer Science
- 2003

Novel algorithms with more restricted guarantees are suggested whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class, but have only a polynomial dependence on the horizon time.

### Contextual Decision Processes with low Bellman rank are PAC-Learnable

- Computer ScienceICML
- 2017

A complexity measure, the Bellman rank, is presented that enables tractable learning of near-optimal behavior in CDPs and is naturally small for many well-studied RL models and provides new insights into efficient exploration for RL with function approximation.

### A Theoretical Analysis of Deep Q-Learning

- Computer ScienceL4DC
- 2020

This work makes the first attempt to theoretically understand the deep Q-network (DQN) algorithm from both algorithmic and statistical perspectives and proposes the Minimax-D QN algorithm for zero-sum Markov game with two players.

### Stable Function Approximation in Dynamic Programming

- Computer Science, MathematicsICML
- 1995

### Deep Reinforcement Learning and the Deadly Triad

- Computer ScienceArXiv
- 2018

This work investigates the impact of the deadly triad in practice, in the context of a family of popular deep reinforcement learning models - deep Q-networks trained with experience replay - analysing how the components of this system play a role in the emergence of the Deadly triad, and in the agent's performance.