# Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation

@article{Shah2020SampleER, title={Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation}, author={D. Shah and Dogyoon Song and Zhi Xu and Yuzhe Yang}, journal={ArXiv}, year={2020}, volume={abs/2006.06135} }

We consider the question of learning $Q$-function in a sample efficient manner for reinforcement learning with continuous state and action spaces under a generative model. If $Q$-function is Lipschitz continuous, then the minimal sample complexity for estimating $\epsilon$-optimal $Q$-function is known to scale as ${\Omega}(\frac{1}{\epsilon^{d_1+d_2 +2}})$ per classical non-parametric learning theory, where $d_1$ and $d_2$ denote the dimensions of the state and action spaces respectively. The… Expand

#### Figures, Tables, and Topics from this paper

#### 5 Citations

Low-rank State-action Value-function Approximation

- Computer Science
- ArXiv
- 2021

This paper proposes different stochastic algorithms to estimate a low-rank factorization of the Q(s, a) matrix, a non-parametric alternative to VF approximation that dramatically reduces the computational and sample complexities relative to classical Q-learning methods that estimate Q(a, s) separately for each state-action pair. Expand

Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations

- Computer Science, Engineering
- ArXiv
- 2021

This work considers the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies Π that may not contain any near-optimal policy and provides an algorithm for this setting whose error is bounded in terms of the rank d of the underlying MDP. Expand

PerSim: Data-Efficient Offline Reinforcement Learning with Heterogeneous Agents via Personalized Simulators

- Computer Science
- ArXiv
- 2021

PerSim proposes a model-based offline RL approach, called PerSim, where a personalized simulator is learned for each agent by collectively using the historical trajectories across all agents prior to learning a policy, and suggests a simple, regularized neural network architecture to effectively learn the transition dynamics per agent, even with scarce, offline data. Expand

Hamiltonian Q-Learning: Leveraging Importance-sampling for Data Efficient RL

- Computer Science, Engineering
- ArXiv
- 2020

Hamiltonian Q-Learning is introduced, a data efficient modification of the Q-learning approach, which adopts an importance-sampling based technique for computing the Q function and exploits the latent low-rank structure of the dynamic system. Expand

Randomized Value Functions via Posterior State-Abstraction Sampling

- Computer Science, Mathematics
- ArXiv
- 2020

It is proposed that an agent seeking out latent task structure must explicitly represent and maintain its uncertainty over that structure as part of its overall uncertainty about the environment and introduced a practical algorithm for doing this using two posterior distributions over state abstractions and abstract-state values. Expand

#### References

SHOWING 1-10 OF 57 REFERENCES

Sample-Optimal Parametric Q-Learning Using Linearly Additive Features

- Computer Science
- ICML
- 2019

This work proposes a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension $K$ and invariant with respect to the size of the state space, and exploits the monotonicity property and intrinsic noise structure of the Bellman operator. Expand

Q-learning with Nearest Neighbors

- Computer Science, Mathematics
- NeurIPS
- 2018

This work considers model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel and establishes a lower bound that argues that the dependence of $ tilde{\Omega}\big(1/\varepsilon^{d+2}\big)$ is necessary. Expand

Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound

- Computer Science, Mathematics
- ICML
- 2020

These results are the first regret bounds that are near-optimal in time $T$ and dimension $d$ (or $\widetilde{d}$) and polynomial in the planning horizon $H$. Expand

Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model

- Computer Science, Mathematics
- NeurIPS
- 2018

The method is extended to computing $\epsilon$-optimal policies for finite-horizon MDP with a generative model and matches the sample complexity lower bounds proved in \cite{azar2013minimax} up to logarithmic factors. Expand

Nuclear norm penalization and optimal rates for noisy low rank matrix completion

- Mathematics
- 2010

This paper deals with the trace regression model where $n$ entries or linear combinations of entries of an unknown $m_1\times m_2$ matrix $A_0$ corrupted by noise are observed. We propose a new… Expand

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

- Computer Science, Mathematics
- Machine Learning
- 2013

We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity… Expand

Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes

- Mathematics, Computer Science
- SODA
- 2018

This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and it is hoped that the analysis may be useful more broadly. Expand

A Theoretical Analysis of Deep Q-Learning

- Computer Science, Mathematics
- L4DC
- 2020

This work makes the first attempt to theoretically understand the deep Q-network (DQN) algorithm from both algorithmic and statistical perspectives and proposes the Minimax-D QN algorithm for zero-sum Markov game with two players. Expand

On the sample complexity of reinforcement learning.

- Computer Science
- 2003

Novel algorithms with more restricted guarantees are suggested whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class, but have only a polynomial dependence on the horizon time. Expand

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

- Computer Science, Mathematics
- ICML
- 2018

This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods. Expand