The Role of Coverage in Online Reinforcement Learning

  title={The Role of Coverage in Online Reinforcement Learning},
  author={Tengyang Xie and Dylan J. Foster and Yu Bai and Nan Jiang and Sham M. Kakade},
Coverage conditions —which assert that the data logging distribution adequately covers the state space—play a fundamental role in determining the sample complexity of offline reinforcement learning. While such conditions might seem irrelevant to online reinforcement learning at first glance, we establish a new connection by showing—somewhat surprisingly—that the mere existence of a data distribution with good coverage can enable sample-efficient online RL. Concretely, we show that coverability —that… 
2 Citations

Figures from this paper

When is Realizability Sufficient for Off-Policy Reinforcement Learning?

These error bounds establish that off-policy reinforcement learning remains statistically viable even in absence of Bellman completeness, and characterize the intermediate situation between the favorable Bellman complete setting and the worst-case scenario where exponential lower bounds are in force.

Leveraging Offline Data in Online Reinforcement Learning

This work characterize the necessary number of online samples needed in this setting given access to some offline dataset, and develops an algorithm, FTPedel, which is provably optimal, for MDPs with linear structure.



What are the Statistical Limits of Offline RL with Linear Function Approximation?

The results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).

Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation

This work proves that in general, even if both concentrability and realizability are satisfied, any algorithm requires sample complexity polynomial in the size of the state space to learn a non-trivial policy, and highlights a phenomenon called over-coverage which serves as a fundamental barrier for offline value function approximation methods.

Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms

A new complexity measure—Bellman Eluder (BE) dimension is introduced and it is proved that both algorithms learn the near-optimal policies of low BE dimension problems in a number of samples that is polynomial in all relevant parameters, but independent of the size of state-action space.

Offline Reinforcement Learning with Realizability and Single-policy Concentrability

A simple algorithm based on the primal-dual formulation of MDPs, where the dual variables are mod-eled using a density-ratio function against offline data and enjoys polynomial sample complexity, under only realizability and single-policy concentrability.

Beyond No Regret: Instance-Dependent PAC Reinforcement Learning

This work shows that there exists a fundamental tradeoff between achieving low regret and identifying an -optimal policy at the instance-optimal rate, and proposes a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP.

Is Pessimism Provably Efficient for Offline RL?

A pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function and establishes a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs).

Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches

Focusing on the special case of factored MDPs, this work proves an exponential lower bound for a general class of model-free approaches, including OLIVE, which, when combined with the algorithmic results, demonstrates exponential separation between model-based and model- free RL in some rich-observation settings.

Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design

This work proposes an algorithm, Pedel, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance.

Is Long Horizon RL More Difficult Than Short Horizon RL?

This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon, and introduces two ideas: the construction of an ε-net for near-optimal policies whose log-covering number scales only Logarithmic with the horizon.

PAC Reinforcement Learning with Rich Observations

A new model for reinforcement learning with rich observations is proposed, generalizing contextual bandits to sequential decision making and it is proved that the algorithm learns near optimal behavior after a number of episodes that is polynomial in all relevant parameters, logarithmic in the number of policies, and independent of the size of the observation space.