• Corpus ID: 244477645

Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation

@inproceedings{Foster2022OfflineRL,
  title={Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation},
  author={Dylan J. Foster and Akshay Krishnamurthy and David Simchi-Levi and Yunzong Xu},
  booktitle={Annual Conference Computational Learning Theory},
  year={2022}
}
We consider the offline reinforcement learning problem, where the aim is to learn a decision making policy from logged data. Offline RL—particularly when coupled with (value) function approximation to allow for generalization in large or continuous state spaces—is becoming increasingly relevant in practice, because it avoids costly and time-consuming online data collection and is well suited to safety-critical domains. Existing sample complexity guarantees for offline value function… 

Figures and Tables from this paper

Offline Reinforcement Learning Under Value and Density-Ratio Realizability: the Power of Gaps

This work is the first to identify the utility and the novel mechanism of gap assumptions in offline RL with weak function approximation and provide guarantees to a simple pes-simistic algorithm based on a version space formed by marginalized importance sampling.

Offline Reinforcement Learning with Differentiable Function Approximation is Provably Efficient

This work shows offline RL with differentiable function approximation is provably efficient by analyzing the pessimistic fitted Q-learning algorithm, and its results provide the theoretical basis for understanding a variety of practical heuristics that rely on Fitted Q-Iteration style design.

Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism

The variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs), and provides improved offline learning bounds over the best-known existing results.

Offline Reinforcement Learning with Realizability and Single-policy Concentrability

A simple algorithm based on the primal-dual formulation of MDPs, where the dual variables are mod-eled using a density-ratio function against offline data and enjoys polynomial sample complexity, under only realizability and single-policy concentrability.

Optimal Conservative Offline RL with General Function Approximation via Augmented Lagrangian

This paper leverages the marginalized importance sampling (MIS) formulation of RL and presents the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability, bypassing the need for uncertainty quantification.

A Sharp Characterization of Linear Estimators for Offline Policy Evaluation

Offline policy evaluation is a fundamental statistical problem in reinforcement learning that involves estimating the value function of some decision-making policy given data collected by a

Learning Bellman Complete Representations for Offline Policy Evaluation

This work proposes BCRL, which directly learns from data an approximately linear Bellman complete representation with good coverage, and shows that this representation enables better OPE compared to previous representation learning methods developed for off-policy RL (e.g., CURL, SPR).

When is Realizability Sufficient for Off-Policy Reinforcement Learning?

These error bounds establish that off-policy reinforcement learning remains statistically viable even in absence of Bellman completeness, and characterize the intermediate situation between the favorable Bellman complete setting and the worst-case scenario where exponential lower bounds are in force.

Oracle Inequalities for Model Selection in Offline Reinforcement Learning

This work proposes the first model selection algorithm for offline RL that achieves minimax rate-optimal oracle inequalities up to logarithmic factors and concludes with several numerical simulations showing it is capable of reliably selecting a good model class.

Behavior Prior Representation learning for Offline Reinforcement Learning

Theoretically, it is proved that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (con-servative algorithms) or produce lower bounds of the policy values (pessimistic algorithms).

References

SHOWING 1-10 OF 50 REFERENCES

What are the Statistical Limits of Offline RL with Linear Function Approximation?

The results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).

Instabilities of Offline RL with Pre-Trained Neural Representation

The methodology explores the ideas when using features from pre-trained neural networks, in the hope that these representations are powerful enough to permit sample efficient offline RL, and finds offline RL is stable only under extremely mild distribution shift.

Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL

This work helps formalize the issue known as deadly triad and explains that the bootstrapping problem is potentially more severe than the extrapolation issue for RL because unlike the latter,Bootstrapping cannot be mitigated by adding more samples, and online exploration is critical to enable sample efficient RL with function approximation.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

A new estimator, MWL, is introduced that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work.

Information-Theoretic Considerations in Batch Reinforcement Learning

This paper revisits two types of assumptions for value-function approximation and provides theoretical results towards answering the above questions, and makes steps towards a deeper understanding of value- function approximation.

Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?

This work provides sharp thresholds for reinforcement learning methods, showing that there are hard limitations on what constitutes good function approximation (in terms of the dimensionality of the representation), and highlighting that having a good representation in and of itself is insufficient for efficient reinforcement learning, unless the quality of this approximation passes certain hard thresholds.

Minimax Value Interval for Off-Policy Evaluation and Policy Optimization

This paper unifies minimax methods for off-policy evaluation using value functions and marginalized importance weights into a single value interval that comes with a special type of double robustness: when either the value-function or the importance-weight class is well specified, the interval is valid and its length quantifies the misspecification of the other class.

Off-Policy Deep Reinforcement Learning without Exploration

This paper introduces a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data.

Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency

Novel alternative completeness conditions under which OPE is feasible are introduced and the first finite-sample result with first-order efficiency in non-tabular environments is presented, i.e., having the minimal coefficient in the leading term.

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

A practical algorithm, bootstrapping error accumulation reduction (BEAR), is proposed and it is demonstrated that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.