author={Christopher J. C. H. Watkins and Peter Dayan},
  journal={Machine Learning},
Q-learning (Watkins, 1989) is a simple way for agents to learn how to act optimally in controlled Markovian domains. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states.This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989). We show thatQ-learning converges to the optimum… 
Labeled Initialized Adaptive Play Q-learning for Stochastic Games
The Labeled IAPQ is able to converge faster than IAPZ by permitting a certain predefined value of learning error and it establishes an effective stopping criterion, which permits terminating the learning process at a near-optimal point with a flexible learning speed/quality tradeoff.
Dynamic Choice of State Abstraction in Q-Learning
The approach significantly outperforms Q-learning during the learning process while not penalizing long-term performance and is coupled with an ad-hoc exploration strategy that aims at collecting key information that allows the algorithm to enrich state descriptions earlier.
Planning and Learning with Stochastic Action Sets
This work formalizes and investigates MDPs with stochastic action sets (SAS-MDPs) to provide these foundations for reinforcement learning, and shows that optimal policies and value functions in this model have a structure that admits a compact representation.
PAC model-free reinforcement learning
This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience, and Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms.
On-policy concurrent reinforcement learning
It is proven that these hybrid techniques are guaranteed to converge to their desired fixed points under some restrictions, and it is shown, experimentally, that the new techniques can learn better policies than the previous algorithms during some phases of the exploration.
Internally Driven Q-learning - Convergence and Generalization Results
Internally Driven Q- learning is more psychologically plausible than Q-learning, and it devolves control and thus autonomy to agents that are otherwise at the mercy of the environment (i.e., of the designer).
Sparse cooperative Q-learning
This paper examines a compact representation of the joint state-action space of a group of cooperative agents, and uses a coordination-graph approach in which the Q-values are represented by value rules that specify the coordination dependencies of the agents at particular states.
Reinforcement learning for factored markov decision processes
This thesis presents a thesis in which novel algorithms are presented for learning the dynamics, learning the value function, and selecting good actions for Markov decision processes, and Simulation results show that these new methods can be used to solve very large problems.
Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis
Q-learning's sample complexity matches that of speedy Q-learning without requiring extra computation and storage, albeit still being considerably higher than the minimax lower bound for problems with long effective horizon.
Real-valued Q-learning in multi-agent cooperation
A Stochastic Recording Real-Valued unit is introduced to Q-learning to differentiate the actions corresponding to different state inputs but categorized to the same state.


Self-improving reactive agents based on reinforcement learning, planning and teaching
This paper compares eight reinforcement learning frameworks: Adaptive heuristic critic (AHC) learning due to Sutton, Q-learning due to Watkins, and three extensions to both basic methods for speeding up learning and two extensions are experience replay, learning action models for planning, and teaching.
Learning control of finite Markov chains with an explicit trade-off between estimation and control
It is proven that this scheme becomes epsilon -optimal as well as optimal by suitable choice of control parameter values in the sense that a relative frequency coefficient of making optimal decisions tends to the maximum.