# Is Q-learning Provably Efficient?

@inproceedings{Jin2018IsQP, title={Is Q-learning Provably Efficient?}, author={Chi Jin and Zeyuan Allen-Zhu and S{\'e}bastien Bubeck and Michael I. Jordan}, booktitle={NeurIPS}, year={2018} }

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that model-free algorithms may require more samples to learn [Deisenroth and Rasmussen 2011, Schulman et al. 2015]. The theoretical question of "whether model-free…

## Tables and Topics from this paper

## 366 Citations

Stochastic Lipschitz Q-Learning

- Mathematics, Computer ScienceArXiv
- 2019

This work proposes a novel algorithm that works for MDPs with a more general setting, which has infinitely many states and actions and assumes that the payoff function and transition kernel are Lipschitz continuous and provides corresponding theory justification for the algorithm.

Efficient Model-free Reinforcement Learning in Metric Spaces

- Computer Science, MathematicsArXiv
- 2019

This work presents an efficient model-free Q-learning based algorithm in MDPs with a natural metric on the state-action space that does not require access to a black-box planning oracle.

A Provably Efficient Sample Collection Strategy for Reinforcement Learning

- Computer Science, MathematicsArXiv
- 2020

This paper derives an algorithm that requires $\tilde{O}( B D + D^{3/2} S^2 A)$ time steps to collect the b(s,a) desired samples, in any unknown and communicating MDP with S states, A actions and diameter $D$.

Provably More Efficient Q-Learning in the Full-Feedback/One-Sided-Feedback Settings

- Computer Science, MathematicsArXiv
- 2020

These numerical experiments using the classical inventory control problem as an example demonstrate the superior efficiency of FQL and HQL, and shows the potential of tailoring reinforcement learning algorithms for richer feedback models, which are prevalent in many natural problems.

Efficient Exploration for Model-based Reinforcement Learning with Continuous States and Actions

- Computer ScienceArXiv
- 2020

The regret bound is improved, and a model-based posterior sampling algorithm with model predictive control for action selection is presented, which achieves the best sample efficiency in benchmark control tasks compared to prior model- based algorithms, and matches the asymptotic performance of model-free algorithms.

On Optimism in Model-Based Reinforcement Learning

- Computer ScienceArXiv
- 2020

This paper introduces a tractable approach to optimism via noise augmented Markov Decision Processes (MDPs), which it is shown can obtain a competitive regret bound when augmenting using Gaussian noise.

Model-Free Approach to Evaluate Reinforcement Learning Algorithms

- 2021

The key objective of Reinforcement Learning (RL) is to learn an optimal agent’s behaviour in an unknown environment. A natural performance metric is given by the value function V π which is the…

Can Q-Learning be Improved with Advice?

- Computer Science, MathematicsArXiv
- 2021

This paper addresses the question of whether worst-case lower bounds for regret in online learning of Markov decision processes (MDPs) can be circumvented when information about the MDP, in the form of predictions about its optimal Q-value function, is given to the algorithm.

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

- Computer Science, MathematicsICML
- 2021

A model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle that drives exploration by simply perturbing the training data with judiciously chosen i.d. scalar noises.

Task-agnostic Exploration in Reinforcement Learning

- Computer Science, MathematicsNeurIPS
- 2020

An efficient task-agnostic RL algorithm that finds near-optimal policies for N arbitrary tasks after at most $\tilde O(\log(N)H^5SA/\epsilon^2)$ exploration episodes and provides an $N$-independent sample complexity bound of \textsc{UCBZero} in the statistically easier setting when the ground truth reward functions are known.

## References

SHOWING 1-10 OF 29 REFERENCES

Temporal Difference Models: Model-Free Deep RL for Model-Based Control

- Computer Science, MathematicsICLR
- 2018

Model-free reinforcement learning (RL) is a powerful, general tool for learning complex behaviors. However, its sample efficiency is often impractically large for solving challenging real-world…

On the sample complexity of reinforcement learning.

- Computer Science
- 2003

Novel algorithms with more restricted guarantees are suggested whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class, but have only a polynomial dependence on the horizon time.

Variance Reduction Methods for Sublinear Reinforcement Learning

- Computer Science, MathematicsArXiv
- 2018

This work considers the problem of provably optimal reinforcement learning for (episodic) finite horizon MDPs, i.e. how an agent learns to maximize his/her (long term) reward in an uncertain…

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

- Computer Science, MathematicsMachine Learning
- 2013

We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity…

Speedy Q-Learning

- Computer ScienceNIPS
- 2011

We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound…

PAC model-free reinforcement learning

- Computer ScienceICML
- 2006

This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience, and Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms.

Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning

- Computer Science, Mathematics2018 IEEE International Conference on Robotics and Automation (ICRA)
- 2018

It is demonstrated that neural network dynamics models can in fact be combined with model predictive control (MPC) to achieve excellent sample complexity in a model-based reinforcement learning algorithm, producing stable and plausible gaits that accomplish various complex locomotion tasks.

Complexity Analysis of Real-Time Reinforcement Learning

- Computer ScienceAAAI
- 1993

This paper analyzes the complexity of on-line reinforcement learning algorithms, namely asynchronous realtime versions of Q-learning and value-iteration, applied to the problem of reaching a goal state in deterministic domains and shows that the algorithms are tractable with only a simple change in the task representation or initialization.

Near-optimal Regret Bounds for Reinforcement Learning

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2008

This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.

Generalization and Exploration via Randomized Value Functions

- Mathematics, Computer ScienceICML
- 2016

The results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.