Corpus ID: 49672097

Is Q-learning Provably Efficient?

@inproceedings{Jin2018IsQP,
  title={Is Q-learning Provably Efficient?},
  author={Chi Jin and Zeyuan Allen-Zhu and S{\'e}bastien Bubeck and Michael I. Jordan},
  booktitle={NeurIPS},
  year={2018}
}
Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that model-free algorithms may require more samples to learn [Deisenroth and Rasmussen 2011, Schulman et al. 2015]. The theoretical question of "whether model-free… Expand
Stochastic Lipschitz Q-Learning
TLDR
This work proposes a novel algorithm that works for MDPs with a more general setting, which has infinitely many states and actions and assumes that the payoff function and transition kernel are Lipschitz continuous and provides corresponding theory justification for the algorithm. Expand
Efficient Model-free Reinforcement Learning in Metric Spaces
TLDR
This work presents an efficient model-free Q-learning based algorithm in MDPs with a natural metric on the state-action space that does not require access to a black-box planning oracle. Expand
A Provably Efficient Sample Collection Strategy for Reinforcement Learning
TLDR
This paper derives an algorithm that requires $\tilde{O}( B D + D^{3/2} S^2 A)$ time steps to collect the b(s,a) desired samples, in any unknown and communicating MDP with S states, A actions and diameter $D$. Expand
Provably More Efficient Q-Learning in the Full-Feedback/One-Sided-Feedback Settings
TLDR
These numerical experiments using the classical inventory control problem as an example demonstrate the superior efficiency of FQL and HQL, and shows the potential of tailoring reinforcement learning algorithms for richer feedback models, which are prevalent in many natural problems. Expand
Efficient Exploration for Model-based Reinforcement Learning with Continuous States and Actions
TLDR
The regret bound is improved, and a model-based posterior sampling algorithm with model predictive control for action selection is presented, which achieves the best sample efficiency in benchmark control tasks compared to prior model- based algorithms, and matches the asymptotic performance of model-free algorithms. Expand
On Optimism in Model-Based Reinforcement Learning
TLDR
This paper introduces a tractable approach to optimism via noise augmented Markov Decision Processes (MDPs), which it is shown can obtain a competitive regret bound when augmenting using Gaussian noise. Expand
Model-Free Approach to Evaluate Reinforcement Learning Algorithms
The key objective of Reinforcement Learning (RL) is to learn an optimal agent’s behaviour in an unknown environment. A natural performance metric is given by the value function V π which is theExpand
Can Q-Learning be Improved with Advice?
TLDR
This paper addresses the question of whether worst-case lower bounds for regret in online learning of Markov decision processes (MDPs) can be circumvented when information about the MDP, in the form of predictions about its optimal Q-value function, is given to the algorithm. Expand
Randomized Exploration for Reinforcement Learning with General Value Function Approximation
TLDR
A model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle that drives exploration by simply perturbing the training data with judiciously chosen i.d. scalar noises. Expand
Task-agnostic Exploration in Reinforcement Learning
TLDR
An efficient task-agnostic RL algorithm that finds near-optimal policies for N arbitrary tasks after at most $\tilde O(\log(N)H^5SA/\epsilon^2)$ exploration episodes and provides an $N$-independent sample complexity bound of \textsc{UCBZero} in the statistically easier setting when the ground truth reward functions are known. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 29 REFERENCES
Temporal Difference Models: Model-Free Deep RL for Model-Based Control
Model-free reinforcement learning (RL) is a powerful, general tool for learning complex behaviors. However, its sample efficiency is often impractically large for solving challenging real-worldExpand
On the sample complexity of reinforcement learning.
TLDR
Novel algorithms with more restricted guarantees are suggested whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class, but have only a polynomial dependence on the horizon time. Expand
Variance Reduction Methods for Sublinear Reinforcement Learning
This work considers the problem of provably optimal reinforcement learning for (episodic) finite horizon MDPs, i.e. how an agent learns to maximize his/her (long term) reward in an uncertainExpand
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model
We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexityExpand
Speedy Q-Learning
We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC boundExpand
PAC model-free reinforcement learning
TLDR
This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience, and Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms. Expand
Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning
TLDR
It is demonstrated that neural network dynamics models can in fact be combined with model predictive control (MPC) to achieve excellent sample complexity in a model-based reinforcement learning algorithm, producing stable and plausible gaits that accomplish various complex locomotion tasks. Expand
Complexity Analysis of Real-Time Reinforcement Learning
TLDR
This paper analyzes the complexity of on-line reinforcement learning algorithms, namely asynchronous realtime versions of Q-learning and value-iteration, applied to the problem of reaching a goal state in deterministic domains and shows that the algorithms are tractable with only a simple change in the task representation or initialization. Expand
Near-optimal Regret Bounds for Reinforcement Learning
TLDR
This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps. Expand
Generalization and Exploration via Randomized Value Functions
TLDR
The results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization. Expand
...
1
2
3
...