• Corpus ID: 59669372

On the sample complexity of reinforcement learning.

  title={On the sample complexity of reinforcement learning.},
  author={Sham M. Kakade},
This thesis is a detailed investigation into the following question: how much data must an agent collect in order to perform “reinforcement learning” successfully. [] Key Method We build on the sample based algorithms suggested by Kearns, Mansour, and Ng [2000]. Their sample complexity bounds have no dependence on the size of the state space, an exponential dependence on the planning horizon time, and linear dependence on the complexity of . We suggest novel algorithms with more restricted guarantees whose…

Figures from this paper

Reinforcement Learning via Online Linear Regression
This work shows that, given an admissible KWIK online linear regression algorithm with the assumption that the target function f is “almost linear” in x (as opposed to being exactly linear in the original setting [7]), this algorithm can be constructed an efficient model-free RL algorithm with linear value functions for general Markov decision processes.
Probably Approximately Correct (PAC) Exploration in Reinforcement Learning
A theorem is presented that provides sufficient conditions for an algorithm to be PAC-MDP, or Probably Approximately Correct (PAC) in RL, and it is shown how these conditions can be applied to prove that efficient learning is possible in three interesting scenerios: finite MDPs (i.e. the “Tabular case”), factored M DPs, and in continuous MDPS with linear dynamics.
Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning
This work develops an algorithm that achieves the same PAC guarantee while using only O(1) episodes of environment interactions, completely settling the horizon-dependence of the sample complexity in RL.
Studying Optimality Bounds for Reinforcement Learning from a KWIK Perspective
  • Computer Science
  • 2016
Another framework for studying the RL learning task, known as Knows What It Knows (KWIK) is explored, which can provide theoretical bounds on maintaing this balance between exploration and exploitation.
Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning
An algorithm is introduced that iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model, and an information-theoretic, Bayesian regret bound is proved for this algorithm that holds for any Night-horizon, episodic sequential decision-making problem.
Reinforcement Learning and Monte-Carlo Methods
This course will largely focus on sample efficiency, that is, the problem of achieving a learning goal using as little data as possible, which many discussions in this course will center around.
Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?
It is proved that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon, and when the values are appropriately normalized, this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense.
Is Q-learning Provably Efficient?
Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically
Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning
This work explores a restricted class of MDPs to obtain guarantees for offline reinforcement learning, discusses algorithms that exploits the AIR property, and provides a theoretical analysis for an algorithm based on Fitted-Q Iteration.
Model-based reinforcement learning with nearly tight exploration complexity bounds
Mormax, a modified version of the Rmax algorithm, is shown to need to make at most O(N log N) exploratory steps, which matches the lower bound up to logarithmic factors, as well as the upper bound of the state-of-the-art model-free algorithm, while the new bound improves the dependence on other problem parameters.


Efficient reinforcement learning
A new formal model for studying reinforcement learning, based on Valiant's PAC framework, that requires the learner to produce a policy whose expected value from the initial state is ε-close to that of the optimal policy, with probability no less than 1−δ.
Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms
It is shown that both Q-learning and the indirect approach enjoy rather rapid convergence to the optimal policy as a function of the number of state transitions observed, and that the amount of memory required by the model-based approach is closer to N than to N2.
Infinite-Horizon Policy-Gradient Estimation
GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies, is introduced.
Learning to Solve Markovian Decision Processes
This dissertation establishes a novel connection between stochastic approximation theory and RL that provides a uniform framework for understanding all the different RL algorithms that have been proposed to date and highlights a dimension that clearly separates all RL research from prior work on DP.
Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr
This paper studies algorithms based on an incremental dynamic programming abstraction of one of the key issues in understanding the behavior of actor-critic learning systems, and finds that, while convergence to optimal performance is not guaranteed in general, there are a number of situations in which such convergence is assured.
Complexity Analysis of Real-Time Reinforcement Learning
This paper analyzes the complexity of on-line reinforcement learning algorithms, namely asynchronous realtime versions of Q-learning and value-iteration, applied to the problem of reaching a goal state in deterministic domains and shows that the algorithms are tractable with only a simple change in the task representation or initialization.
Approximate Planning in Large POMDPs via Reusable Trajectories
Upper bounds on the sample complexity are proved showing that, even for infinitely large and arbitrarily complex POMDPs, the amount of data needed can be finite, and depends only linearly on the complexity of the restricted strategy class II, and exponentially on the horizon time.
Exploration in Gradient-Based Reinforcement Learning
This paper provides a method for using importance sampling to allow any well-behaved directed exploration policy during learning to be allowed, and shows both theoretically and experimentally that using this method can achieve dramatic performance improvements.