• Corpus ID: 249642134

Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency

  title={Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency},
  author={Qi Cai and Zhuoran Yang and Zhaoran Wang},
We study reinforcement learning for partially observed Markov decision processes (POMDPs) with infinite observation and state spaces, which remains less investigated theoretically. To this end, we make the first attempt at bridging partial observability and function approximation for a class of POMDPs with a linear structure. In detail, we propose a reinforcement learning algorithm (Optimistic Exploration via Adversarial Integral Equation or OP-TENET) that attains an ǫ-optimal policy within O(1… 

Figures from this paper



Sample-Efficient Reinforcement Learning of Undercomplete POMDPs

This work presents a sample-efficient algorithm, OOM-UCB, for episodic finite undercomplete POMDPs, where the number of observations is larger than thenumber of latent states and where exploration is essential for learning, thus distinguishing the results from prior works.

Reinforcement Learning of POMDPs using Spectral Methods

This work proposes a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods and proves an order-optimal regret bound with respect to the optimal memoryless policy and efficient scaling withrespect to the dimensionality of observation and action spaces.

Provably Efficient Reinforcement Learning with Linear Function Approximation

This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.

Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

This work considers off-policy evaluation in a partially observed MDP (POMDP) by considering estimating the value of a given target policy in a POMDP given trajectories with only partial state observations generated by a different and unknown policy that may depend on the unobserved state.

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

A new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise and a new, computationally efficient algorithm with linear function approximation named UCRL-VTR for the aforementioned linear mixture MDPs in the episodic undiscounted setting are proposed.

Provably Efficient Exploration in Policy Optimization

This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.

Planning in Observable POMDPs in Quasipolynomial Time

This work proposes a quasipolynomial-time algorithm for planning in (one-step) observable POMDPs that assumes that well-separated distributions on states lead to well separated distributions on observations, and thus the observations are at least somewhat informative in each step.

RL for Latent MDPs: Regret Guarantees and a Lower Bound

This work considers the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP) and shows that the key link is a notion of separation between the MDP system dynamics, providing an efficient algorithm with local guarantee.

Is Q-learning Provably Efficient?

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically

Bilinear Classes: A Structural Framework for Provable Generalization in RL

This work provides an RL algorithm which has polynomial sample complexity for Bilinear Classes, a new structural framework which permit generalization in reinforcement learning in a wide variety of settings through the use of function approximation.