• Corpus ID: 232290507

Bilinear Classes: A Structural Framework for Provable Generalization in RL

@inproceedings{Du2021BilinearCA,
  title={Bilinear Classes: A Structural Framework for Provable Generalization in RL},
  author={Simon Shaolei Du and Sham M. Kakade and Jason D. Lee and Shachar Lovett and Gaurav Mahajan and Wen Sun and Ruosong Wang},
  booktitle={ICML},
  year={2021}
}
This work introduces Bilinear Classes, a new structural framework, which permit generalization in reinforcement learning in a wide variety of settings through the use of function approximation. The framework incorporates nearly all existing models in which a polynomial sample complexity is achievable, and, notably, also includes new models, such as the Linear Q∗/V ∗ model in which both the optimal Q-function and the optimal V -function are linear in some known feature space. Our main result… 

Tables from this paper

Sample-Efficient Reinforcement Learning for POMDPs with Linear Function Approximations
TLDR
An RL algorithm is proposed that constructs optimistic estimators of undercomplete POMDPs with linear function approximations via reproducing kernel Hilbert space (RKHS) embedding and it is theoretically proved that the proposed algorithm has an ε -optimal policy with e O (1 /ε 2 ) episodes of exploration.
Overleaf Example
  • 2021
Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure
TLDR
A new algorithm is provided along with statistical guarantees that efficiently exploits low rank structure given access to a generative model, achieving a sample complexity of Õ ( d5(|S|+ |A |)poly(H)/ǫ2 ) for a rank d setting, which is minimax optimal with respect to the scaling of |S|, |A|, and ǫ.
TensorPlan and the Few Actions Lower Bound for Planning in MDPs under Linear Realizability of Optimal Value Functions
TLDR
The minimax query complexity of online planning with a generative model in fixedhorizon Markov decision processes (MDPs) with linear function approximation is considered and an exponentially large lower bound holds when A = Ω(min(d1/4, H1/2), under either (i), (ii) or (iii).
Efficient Local Planning with Linear Function Approximation
TLDR
This work proposes two new algorithms for discounted Markov decision processes with linear function approximation and a simulator that have polynomial query and computational cost in the dimension of the features, the effective planning horizon and the targeted sub-optimality, while the cost remains independent of the size of the state space.
Representation Learning for Online and Offline RL in Low-rank MDPs
TLDR
An algorithm REP-UCB—Upper Confidence Bound driven REPresentation learning for RL, which significantly improves the sample complexity and is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation.
Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning
TLDR
A new offline actor-critic algorithm is proposed that naturally incorporates the pessimism principle, leading to several key advantages compared to the state of the art, and an upper bound on the suboptimality gap of the policy returned by the procedure is proved.
Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms
TLDR
A new complexity measure—Bellman Eluder (BE) dimension is introduced and it is proved that both algorithms learn the near-optimal policies of low BE dimension problems in a number of samples that is polynomial in all relevant parameters, but independent of the size of state-action space.
Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms
TLDR
It is proved that both algorithms learn the near-optimal policies of low BE dimension problems in a number of samples that is polynomial in all relevant parameters, but independent of the size of state-action space.
Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting
TLDR
A new sampling protocol is investigated, which draws samples in an online/exploratory fashion but allows one to backtrack and revisit previous states but not the size of the state/action space, and an algorithm is developed that achieves a sample complexity that scales polynomially with the feature dimension, the horizon, and the inverse sub-optimality gap.
...
...

References

SHOWING 1-10 OF 55 REFERENCES
On the Expressivity of Neural Networks for Deep Reinforcement Learning
TLDR
It is shown, theoretically and empirically, that even for one-dimensional continuous state space, there are many MDPs whose optimal $Q$-functions and policies are much more complex than the dynamics.
Learning Near Optimal Policies with Low Inherent Bellman Error
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to
Sample-Optimal Parametric Q-Learning Using Linearly Additive Features
TLDR
This work proposes a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension $K$ and invariant with respect to the size of the state space, and exploits the monotonicity property and intrinsic noise structure of the Bellman operator.
Contextual Decision Processes with low Bellman rank are PAC-Learnable
TLDR
A complexity measure, the Bellman rank, is presented that enables tractable learning of near-optimal behavior in CDPs and is naturally small for many well-studied RL models and provides new insights into efficient exploration for RL with function approximation.
Predictive Representations of State
TLDR
This is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls) and it is shown that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model.
Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design
TLDR
This work analyzes GP-UCB, an intuitive upper-confidence based algorithm, and bound its cumulative regret in terms of maximal information gain, establishing a novel connection between GP optimization and experimental design and obtaining explicit sublinear regret bounds for many commonly used covariance functions.
Provably Efficient Reinforcement Learning with Aggregated States
TLDR
This work establishes that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret, the first such result that applies to reinforcement learning with nontrivial value function approximation without any restrictions on transition probabilities.
Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles
TLDR
This paper considers a setting where an ensemble of pre-trained and possibly inaccurate simulators (models) are available, and approximate the real environment using a state-dependent linear combination of the ensemble, where the coefficients are determined by the given state features and some unknown parameters.
A unifying framework for computational reinforcement learning theory
TLDR
This thesis is that the KWIK learning model provides a flexible, modularized, and unifying way for creating and analyzing reinforcement-learning algorithms with provably efficient exploration and facilitates the development of new algorithms with smaller sample complexity, which have demonstrated empirically faster learning speed in real-world problems.
High-dimensional statistics: A non-asymptotic viewpoint, volume 48
  • 2019
...
...