Corpus ID: 219955936

Provably adaptive reinforcement learning in metric spaces

  title={Provably adaptive reinforcement learning in metric spaces},
  author={Tongyi Cao and Akshay Krishnamurthy},
We study reinforcement learning in continuous state and action spaces endowed with a metric. We provide a refined analysis of the algorithm of Sinclair, Banerjee, and Yu (2019) and show that its regret scales with the \emph{zooming dimension} of the instance. This parameter, which originates in the bandit literature, captures the size of the subsets of near optimal actions and is always smaller than the covering dimension used in previous analyses. As such, our results are the first provably… Expand
Regret Bounds for Adaptive Nonlinear Control
The first finite-time regret bounds for adaptive nonlinear control with matched uncertainty in the stochastic setting are proved, showing that the regret suffered by certainty equivalence adaptive control, compared to an oracle controller with perfect knowledge of the unmodeled disturbances, is upper bounded by $\widetilde{O}(\sqrt{T})$ in expectation. Expand
Adaptive Discretization for Model-Based Reinforcement Learning
This work introduces the technique of adaptive discretization to design efficient model-based episodic reinforcement learning algorithms in large (potentially continuous) state-action spaces and provides worst-case regret bounds for this algorithm, which are competitive compared to the state-of-the-art model- based algorithms. Expand


Adaptive Discretization for Episodic Reinforcement Learning in Metric Spaces
This work presents an efficient algorithm for model-free episodic reinforcement learning on large (potentially continuous) state-action spaces, based on a novel Q-learning policy with adaptive data-driven discretization, which recovers the regret guarantees of prior algorithms for continuous state- action spaces. Expand
Zooming for Efficient Model-Free Reinforcement Learning in Metric Spaces
This paper proposes ZoomRL, an online algorithm that leverages ideas from continuous bandits to learn an adaptive discretization of the joint space by zooming in more promising and frequently visited regions while carefully balancing the exploitation-exploration trade-off. Expand
Learning to Control in Metric Space with Optimal Regret
This work provides a surprisingly simple upper-confidence reinforcement learning algorithm that uses a function approximation oracle to estimate optimistic Q functions from experiences and establishes a near-matching regret lower bound. Expand
Exploration in Metric State Spaces
We present metric-E3, a provably near-optimal algorithm for reinforcement learning in Markov decision processes in which there is a natural metric on the state space that allows the construction ofExpand
Efficient Model-free Reinforcement Learning in Metric Spaces
This work presents an efficient model-free Q-learning based algorithm in MDPs with a natural metric on the state-action space that does not require access to a black-box planning oracle. Expand
Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning
A new framework for theoretically measuring the performance of reinforcement learning algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework, and simultaneously achieves optimal regret and PAC guarantees except for a factor of the horizon. Expand
Is Q-learning Provably Efficient?
Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typicallyExpand
Contextual Bandits with Continuous Actions: Smoothing, Zooming, and Adapting
Two qualitatively different regret bounds are obtained: one competes with a smoothed version of the policy class under no continuity assumptions, while the other requires standard Lipschitz assumptions. Expand
Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty and derives sublinear regret bounds for undiscounted reinforcement learning in continuous state space. Expand
Adaptive aggregation for reinforcement learning in average reward Markov decision processes
  • R. Ortner
  • Mathematics, Computer Science
  • Ann. Oper. Res.
  • 2013
An algorithm which aggregates online when learning to behave optimally in an average reward Markov decision process and derives bounds on the regret this algorithm suffers with respect to an optimal policy is presented. Expand