• Corpus ID: 239016143

Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs

  title={Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs},
  author={Naman Agarwal and Syomantak Chaudhuri and Prateek Jain and Dheeraj M. Nagaraj and Praneeth Netrapalli},
Q-learning is a popular Reinforcement Learning (RL) algorithm which is widely used in practice with function approximation (Mnih et al., 2015). In contrast, existing theoretical results are pessimistic about Q-learning. For example, (Baird, 1995) shows that Q-learning does not converge even with linear function approximation for linear MDPs. Furthermore, even for tabular MDPs with synchronous updates, Q-learning was shown to have sub-optimal sample complexity (Li et al., 2021; Azar et al., 2013… 

Figures and Tables from this paper

Streaming Linear System Identification with Reverse Experience Replay
This work provides the first - to the best of the knowledge - optimal SGD-style algorithm for the classical problem of linear system identification aka VAR model estimation and demonstrates that knowledge of dependency structure can aid us in designing algorithms which can deconstruct the dependencies between samples optimally in an online fashion.


Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms
It is shown that both Q-learning and the indirect approach enjoy rather rapid convergence to the optimal policy as a function of the number of state transitions observed, and that the amount of memory required by the model-based approach is closer to N than to N2.
A Theoretical Analysis of Deep Q-Learning
This work makes the first attempt to theoretically understand the deep Q-network (DQN) algorithm from both algorithmic and statistical perspectives and proposes the Minimax-D QN algorithm for zero-sum Markov game with two players.
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model
We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity
Provably Efficient Reinforcement Learning with Linear Function Approximation
This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.
A new convergent variant of Q-learning with linear function approximation
This work identifies a novel set of conditions that ensure convergence with probability 1 of Q-learning with linear function approximation, by proposing a two time-scale variation thereof and establishes the convergence of the algorithm.
A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation
This paper proves that neural Q-learning finds the optimal policy with O(1/\sqrt{T})$ convergence rate if the neural function approximator is sufficiently overparameterized, where $T$ is the number of iterations.
Momentum Q-learning with Finite-Sample Convergence Guarantee
This paper proposes the MomentumQ algorithm, which integrates the Nesterov's and Polyak's momentum schemes, and generalizes the existing momentum-based Q-learning algorithms, and establishes the convergence guarantee for MomentumZ with linear function approximations and Markovian sampling.
Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal
This work builds the maximum likelihood estimate of the transition model in the MDP from observations and then finds an optimal policy in this empirical MDP, which simplifies algorithm design as this approach does not tie the algorithm to the sampling procedure.
Self-improving reactive agents based on reinforcement learning, planning and teaching
This paper compares eight reinforcement learning frameworks: Adaptive heuristic critic (AHC) learning due to Sutton, Q-learning due to Watkins, and three extensions to both basic methods for speeding up learning and two extensions are experience replay, learning action models for planning, and teaching.
Toward Off-Policy Learning Control with Function Approximation
The Greedy-GQ algorithm is an extension of recent work on gradient temporal-difference learning to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function.