Bounds on Sample Size for Policy Evaluation in Markov Environments

  title={Bounds on Sample Size for Policy Evaluation in Markov Environments},
  author={Leonid Peshkin and Sayan Mukherjee},
Reinforcement learning means finding the optimal course of action in Markovian environments without knowledge of the environment's dynamics. Stochastic optimization algorithms used in the field rely on estimates of the value of a policy. Typically, the value of a policy is estimated from results of simulating that very policy in the environment. This approach requires a large amount of simulation as different points in the policy space are considered. In this paper, we develop value estimators… 

Learning from Scarce Experience

A family of algorithms based on likelihood ratio estimation that use data gathered when executing one policy (or collection of policies) to estimate the value of a different policy and show positive empirical results and provide the sample complexity bound.

PAC bounds for simulation-based optimization of Markov decision processes

  • T. Watson
  • Mathematics, Computer Science
    2007 46th IEEE Conference on Decision and Control
  • 2007
An estimate is obtained for the value function of a Markov decision process, which assigns to each policy its expected discounted reward, and a framework to obtain an e-optimal policy from simulation is proposed.

Simulation-based Uniform Value Function Estimates of Markov Decision Processes

Borders on the number of runs needed for the uniform convergence of the empirical average to the expected reward for a class of policies are derived, in terms of the Vapnik-Chervonenkis or P-dimension of the policy class.

Policy Improvement for POMDPs Using Normalized Importance Sampling

A new method for estimating the expected return of a POMDP from experience that is motivated from function-approximation and importance sampling points-of-view, which has low variance and the bias is often irrelevant when the estimator is used for pair-wise comparisons.

Balanced Importance Sampling Estimation

This paper introduces the family of balanced importance sampling estimators, which prove their consistency and demonstrate empirically their superiority over the classical counterparts.

Balanced Importance Sampling Estimation

  • A. Pacut
  • Mathematics, Computer Science
  • 2006
This paper introduces the family of balanced importance sampling estimators, which prove their consistency and demonstrate empirically their superiority over the classical counterparts.

Importance sampling for reinforcement learning with multiple objectives

This thesis considers three complications that arise from applying reinforcement learning to a real-world application, and employs importance sampling (likelihood ratios) to achieve good performance in partially observable Markov decision processes with few data.

Representation of a State Value Function No Yes Representation No Simple DP Value Function of a Policy Yes Policy Search Hybrid

A survey of policy search algorithms in reinforcement learning is presented, examining practical applications, future trends and other issues that pertain to current day policy search techniques.

Integrated Common Sense Learning and Planning in POMDPs

The results essentially establish that the possession of a suitable exploration policy for collecting the necessary examples is the fundamental obstacle to learning to act in such environments.



Exploration in Gradient-Based Reinforcement Learning

This paper provides a method for using importance sampling to allow any well-behaved directed exploration policy during learning to be allowed, and shows both theoretically and experimentally that using this method can achieve dramatic performance improvements.

Reinforcement learning and mistake bounded algorithms

This work explores an interesting connection between mistake bounded learning algorithms and computing a near-best strategy, from a restricted class of strategies, for a given POMDP.

Approximate Planning in Large POMDPs via Reusable Trajectories

Upper bounds on the sample complexity are proved showing that, even for infinitely large and arbitrarily complex POMDPs, the amount of data needed can be finite, and depends only linearly on the complexity of the restricted strategy class II, and exponentially on the horizon time.

Eligibility Traces for Off-Policy Policy Evaluation

This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.

Learning to Cooperate via Policy Search

This paper provides a gradient-based distributed policy-search method for cooperative games and compares the notion of local optimum to that of Nash equilibrium, and demonstrates the effectiveness of this method experimentally in a small, partially observable simulated soccer domain.

The Complexity of Markov Decision Processes

All three variants of the classical problem of optimal policy computation in Markov decision processes, finite horizon, infinite horizon discounted, and infinite horizon average cost are shown to be complete for P, and therefore most likely cannot be solved by highly parallel algorithms.

Learning Policies with External Memory

This paper explores a {\it stigmergic} approach, in which the agent's actions include the ability to set and clear bits in an external memory, and the external memory is included as part of the input to the agent.

Learning Finite-State Controllers for Partially Observable Environments

Because it performs stochastic gradient descent, this algorithm can be shown to converge to a locally optimal finitestate controller and the ability of the algorithm to extract the useful information contained in the sequence of past observations to compensate for the lack of observability at each time-step is shown.

Adaptive Importance Sampling for Estimation in Structured Domains

This paper presents a stochastic-gradient-descent method for sequentially updating the sampling distribution based on the direct minimization of the variance, and presents other stoChastic- gradient- Descent methods based upon the minimizations of typical notions of distance between the current sampling distribution and approximations of the target, optimal distribution.

Markov Decision Processes: Discrete Stochastic Dynamic Programming

  • M. Puterman
  • Computer Science
    Wiley Series in Probability and Statistics
  • 1994
Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.