Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

@article{Wen2017EfficientRL,
  title={Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization},
  author={Zheng Wen and Benjamin Van Roy},
  journal={Math. Oper. Res.},
  year={2017},
  volume={42},
  pages={762-782}
}
We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function lies within a given hypothesis class, OCP selects optimal actions over all but at most K episodes, where K is the eluder dimension of the given hypothesis class. We establish further… 

Figures from this paper

On the Sample Complexity of Reinforcement Learning with Policy Space Generalization
TLDR
A new notion of eluder dimension for the policy space is proposed, which characterizes the intrinsic complexity of policy learning in an arbitrary Markov Decision Process (MDP), and a near-optimal sample complexity upper bound is proved that only depends linearly on theEluder dimension.
Learning to Control in Metric Space with Optimal Regret
TLDR
This work provides a surprisingly simple upper-confidence reinforcement learning algorithm that uses a function approximation oracle to estimate optimistic Q functions from experiences and establishes a near-matching regret lower bound.
On Oracle-Efficient PAC Reinforcement Learning with Rich Observations
TLDR
It is proved that the only known sample-efficient algorithm, Olive, cannot be implemented in the oracle model, and new sample- efficient algorithms are presented for environments with deterministic hidden state dynamics and stochastic rich observations.
Provably Efficient Reinforcement Learning with Linear Function Approximation
TLDR
This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.
Provably Efficient Reinforcement Learning with Linear Function Approximation
TLDR
This paper proves that an optimistic modification of Least-Squares Value Iteration—a classical algorithm frequently studied in the linear setting—achieves regret, which is independent of the number of states and actions, without requiring a “simulator” or additional assumptions.
Provably Efficient Reinforcement Learning with Aggregated States
TLDR
This work establishes that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret, the first such result that applies to reinforcement learning with nontrivial value function approximation without any restrictions on transition probabilities.
On Polynomial Time PAC Reinforcement Learning with Rich Observations
TLDR
It is shown that the only known statistically efficient algorithm for the more general stochastic transition setting requires NP-hard computation which cannot be implemented via standard optimization primitives.
Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting
TLDR
A new sampling protocol is investigated, which draws samples in an online/exploratory fashion but allows one to backtrack and revisit previous states but not the size of the state/action space, and an algorithm is developed that achieves a sample complexity that scales polynomially with the feature dimension, the horizon, and the inverse sub-optimality gap.
Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting
TLDR
A new sampling protocol is investigated, which draws samples in an online/exploratory fashion but allows one to backtrack and revisit previous states but not the size of the state/action space, and an algorithm is developed that achieves a sample complexity that scales polynomially with the feature dimension, the horizon, and the inverse sub-optimality gap.
Provably Efficient Exploration in Policy Optimization
TLDR
This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.
...
...

References

SHOWING 1-10 OF 47 REFERENCES
Efficient Exploration and Value Function Generalization in Deterministic Systems
TLDR
Optimistic constraint propagation (OCP) is proposed, an algorithm designed to synthesize efficient exploration and value function generalization that selects optimal actions over all but at most dimE[Q] episodes, where dimE denotes the eluder dimension.
Near-Optimal Reinforcement Learning in Polynomial Time
TLDR
New algorithms for reinforcement learning are presented and it is proved that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes.
( More ) Efficient Reinforcement Learning via Posterior Sampling
  • Ian
  • Computer Science
  • 2013
TLDR
An Õ(τS √ AT ) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.
(More) Efficient Reinforcement Learning via Posterior Sampling
TLDR
An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.
Probably Approximately Correct (PAC) Exploration in Reinforcement Learning
TLDR
A theorem is presented that provides sufficient conditions for an algorithm to be PAC-MDP, or Probably Approximately Correct (PAC) in RL, and it is shown how these conditions can be applied to prove that efficient learning is possible in three interesting scenerios: finite MDPs (i.e. the “Tabular case”), factored M DPs, and in continuous MDPS with linear dynamics.
Efficient Reinforcement Learning in Factored MDPs
We present a provably efficient and near-optimal algorithm for reinforcement learning in Markov decision processes (MDPs) whose transition model can be factored as a dynamic Bayesian network (DBN).
PAC model-free reinforcement learning
TLDR
This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience, and Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms.
Generalization and Exploration via Randomized Value Functions
TLDR
The results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.
Regret Bounds for Reinforcement Learning with Policy Advice
TLDR
It is proved that RLPA has a sub-linear regret of $\widetilde O(\sqrt{T})$ relative to the best input policy, and that both this regret and its computational complexity are independent of the size of the state and action space.
On the sample complexity of reinforcement learning.
TLDR
Novel algorithms with more restricted guarantees are suggested whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class, but have only a polynomial dependence on the horizon time.
...
...