Corpus ID: 219573326

Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation

  title={Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation},
  author={D. Shah and Dogyoon Song and Zhi Xu and Yuzhe Yang},
We consider the question of learning $Q$-function in a sample efficient manner for reinforcement learning with continuous state and action spaces under a generative model. If $Q$-function is Lipschitz continuous, then the minimal sample complexity for estimating $\epsilon$-optimal $Q$-function is known to scale as ${\Omega}(\frac{1}{\epsilon^{d_1+d_2 +2}})$ per classical non-parametric learning theory, where $d_1$ and $d_2$ denote the dimensions of the state and action spaces respectively. The… Expand
Low-rank State-action Value-function Approximation
This paper proposes different stochastic algorithms to estimate a low-rank factorization of the Q(s, a) matrix, a non-parametric alternative to VF approximation that dramatically reduces the computational and sample complexities relative to classical Q-learning methods that estimate Q(a, s) separately for each state-action pair. Expand
Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations
This work considers the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies Π that may not contain any near-optimal policy and provides an algorithm for this setting whose error is bounded in terms of the rank d of the underlying MDP. Expand
PerSim: Data-Efficient Offline Reinforcement Learning with Heterogeneous Agents via Personalized Simulators
PerSim proposes a model-based offline RL approach, called PerSim, where a personalized simulator is learned for each agent by collectively using the historical trajectories across all agents prior to learning a policy, and suggests a simple, regularized neural network architecture to effectively learn the transition dynamics per agent, even with scarce, offline data. Expand
Hamiltonian Q-Learning: Leveraging Importance-sampling for Data Efficient RL
Hamiltonian Q-Learning is introduced, a data efficient modification of the Q-learning approach, which adopts an importance-sampling based technique for computing the Q function and exploits the latent low-rank structure of the dynamic system. Expand
Randomized Value Functions via Posterior State-Abstraction Sampling
It is proposed that an agent seeking out latent task structure must explicitly represent and maintain its uncertainty over that structure as part of its overall uncertainty about the environment and introduced a practical algorithm for doing this using two posterior distributions over state abstractions and abstract-state values. Expand


Sample-Optimal Parametric Q-Learning Using Linearly Additive Features
This work proposes a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension $K$ and invariant with respect to the size of the state space, and exploits the monotonicity property and intrinsic noise structure of the Bellman operator. Expand
Q-learning with Nearest Neighbors
This work considers model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel and establishes a lower bound that argues that the dependence of $ tilde{\Omega}\big(1/\varepsilon^{d+2}\big)$ is necessary. Expand
Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound
These results are the first regret bounds that are near-optimal in time $T$ and dimension $d$ (or $\widetilde{d}$) and polynomial in the planning horizon $H$. Expand
Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model
The method is extended to computing $\epsilon$-optimal policies for finite-horizon MDP with a generative model and matches the sample complexity lower bounds proved in \cite{azar2013minimax} up to logarithmic factors. Expand
Nuclear norm penalization and optimal rates for noisy low rank matrix completion
This paper deals with the trace regression model where $n$ entries or linear combinations of entries of an unknown $m_1\times m_2$ matrix $A_0$ corrupted by noise are observed. We propose a newExpand
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model
We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexityExpand
Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes
This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and it is hoped that the analysis may be useful more broadly. Expand
A Theoretical Analysis of Deep Q-Learning
This work makes the first attempt to theoretically understand the deep Q-network (DQN) algorithm from both algorithmic and statistical perspectives and proposes the Minimax-D QN algorithm for zero-sum Markov game with two players. Expand
On the sample complexity of reinforcement learning.
Novel algorithms with more restricted guarantees are suggested whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class, but have only a polynomial dependence on the horizon time. Expand
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods. Expand