# Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

@article{Wen2017EfficientRL,
title={Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization},
author={Zheng Wen and Benjamin Van Roy},
journal={Math. Oper. Res.},
year={2017},
volume={42},
pages={762-782}
}
• Published 18 July 2013
• Mathematics
• Math. Oper. Res.
We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function lies within a given hypothesis class, OCP selects optimal actions over all but at most K episodes, where K is the eluder dimension of the given hypothesis class. We establish further…

## Figures from this paper

On the Sample Complexity of Reinforcement Learning with Policy Space Generalization
• Computer Science, Mathematics
ArXiv
• 2020
A new notion of eluder dimension for the policy space is proposed, which characterizes the intrinsic complexity of policy learning in an arbitrary Markov Decision Process (MDP), and a near-optimal sample complexity upper bound is proved that only depends linearly on theEluder dimension.
Learning to Control in Metric Space with Optimal Regret
• Computer Science
2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
• 2019
This work provides a surprisingly simple upper-confidence reinforcement learning algorithm that uses a function approximation oracle to estimate optimistic Q functions from experiences and establishes a near-matching regret lower bound.
On Oracle-Efficient PAC Reinforcement Learning with Rich Observations
• Computer Science
• 2018
It is proved that the only known sample-efficient algorithm, Olive, cannot be implemented in the oracle model, and new sample- efficient algorithms are presented for environments with deterministic hidden state dynamics and stochastic rich observations.
Provably Efficient Reinforcement Learning with Linear Function Approximation
• Computer Science
COLT
• 2020
This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.
Provably Efficient Reinforcement Learning with Linear Function Approximation
This paper proves that an optimistic modification of Least-Squares Value Iteration—a classical algorithm frequently studied in the linear setting—achieves regret, which is independent of the number of states and actions, without requiring a “simulator” or additional assumptions.
Provably Efficient Reinforcement Learning with Aggregated States
• Computer Science
ArXiv
• 2019
This work establishes that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret, the first such result that applies to reinforcement learning with nontrivial value function approximation without any restrictions on transition probabilities.
On Polynomial Time PAC Reinforcement Learning with Rich Observations
• Computer Science
ArXiv
• 2018
It is shown that the only known statistically efficient algorithm for the more general stochastic transition setting requires NP-hard computation which cannot be implemented via standard optimization primitives.
Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting
A new sampling protocol is investigated, which draws samples in an online/exploratory fashion but allows one to backtrack and revisit previous states but not the size of the state/action space, and an algorithm is developed that achieves a sample complexity that scales polynomially with the feature dimension, the horizon, and the inverse sub-optimality gap.
Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting
A new sampling protocol is investigated, which draws samples in an online/exploratory fashion but allows one to backtrack and revisit previous states but not the size of the state/action space, and an algorithm is developed that achieves a sample complexity that scales polynomially with the feature dimension, the horizon, and the inverse sub-optimality gap.
Provably Efficient Exploration in Policy Optimization
• Computer Science
ICML
• 2020
This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.

## References

SHOWING 1-10 OF 47 REFERENCES
Efficient Exploration and Value Function Generalization in Deterministic Systems
• Mathematics
NIPS
• 2013
Optimistic constraint propagation (OCP) is proposed, an algorithm designed to synthesize efficient exploration and value function generalization that selects optimal actions over all but at most dimE[Q] episodes, where dimE denotes the eluder dimension.
Near-Optimal Reinforcement Learning in Polynomial Time
• Computer Science
Machine Learning
• 2004
New algorithms for reinforcement learning are presented and it is proved that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes.
( More ) Efficient Reinforcement Learning via Posterior Sampling
• Ian
• Computer Science
• 2013
An Õ(τS √ AT ) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.
(More) Efficient Reinforcement Learning via Posterior Sampling
• Computer Science
NIPS
• 2013
An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.
Probably Approximately Correct (PAC) Exploration in Reinforcement Learning
A theorem is presented that provides sufficient conditions for an algorithm to be PAC-MDP, or Probably Approximately Correct (PAC) in RL, and it is shown how these conditions can be applied to prove that efficient learning is possible in three interesting scenerios: finite MDPs (i.e. the “Tabular case”), factored M DPs, and in continuous MDPS with linear dynamics.
Efficient Reinforcement Learning in Factored MDPs
• Computer Science
IJCAI
• 1999
We present a provably efficient and near-optimal algorithm for reinforcement learning in Markov decision processes (MDPs) whose transition model can be factored as a dynamic Bayesian network (DBN).
PAC model-free reinforcement learning
• Computer Science
ICML
• 2006
This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience, and Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms.
Generalization and Exploration via Randomized Value Functions
• Computer Science
ICML
• 2016
The results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.
Regret Bounds for Reinforcement Learning with Policy Advice
• Computer Science
ECML/PKDD
• 2013
It is proved that RLPA has a sub-linear regret of $\widetilde O(\sqrt{T})$ relative to the best input policy, and that both this regret and its computational complexity are independent of the size of the state and action space.
On the sample complexity of reinforcement learning.
Novel algorithms with more restricted guarantees are suggested whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class, but have only a polynomial dependence on the horizon time.