Corpus ID: 380221

High-Confidence Off-Policy Evaluation

@inproceedings{Thomas2015HighConfidenceOE,
  title={High-Confidence Off-Policy Evaluation},
  author={Philip S. Thomas and Georgios Theocharous and Mohammad Ghavamzadeh},
  booktitle={AAAI},
  year={2015}
}
Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not… Expand

Figures, Tables, and Topics from this paper

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
TLDR
A new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy, based on an extension of the doubly robust estimator and a new way to mix between model based estimates and importance sampling based estimates. Expand
High Confidence Off-Policy Evaluation with Models
TLDR
This work proposes two bootstrapping approaches combined with learned MDP transition models in order to efficiently estimate lower confidence bounds on policy performance with limited data in both continuous and discrete state spaces and derives a theoretical upper bound on model bias. Expand
On- and Off-Policy Monotonic Policy Improvement
TLDR
This paper shows that the monotonic policy improvement is guaranteed from on- and off-policy mixture samples, and provides a trust region policy optimization method using experience replay as a naive application of the proposed bound. Expand
Marginalized Off-Policy Evaluation for Reinforcement Learning
TLDR
This paper proposes a new off-policy evaluation approach directly based on the discrete directed acyclic graph (DAG) MDPs that can be applied to most of the estimators of off-Policy evaluation without modification and could reduce the variance dramatically. Expand
High Confidence Policy Improvement
We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameters that requireExpand
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
TLDR
This work extends the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators. Expand
A Configurable off-Policy Evaluation with Key State-Based Bias Constraints in AI Reinforcement Learning
TLDR
This paper develops a configurable OPE with key state-based bias constraints, and adopts FP-Growth to mine the key states and get corresponding reward expectations of key states, and configures every reward expectation scope as bias constraint to construct new goal function with the combination of bias and variance. Expand
Off-policy Model-based Learning under Unknown Factored Dynamics
TLDR
The G-SCOPE algorithm is introduced that evaluates a new policy based on data generated by the existing policy and is both computationally and sample efficient because it greedily learns to exploit factored structure in the dynamics of the environment. Expand
Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation
TLDR
Two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data in both continuous and discrete state spaces are proposed. Expand
Accountable Off-Policy Evaluation With Kernel Bellman Statistics
TLDR
A new variational framework is proposed which reduces the problem of calculating tight confidence bounds in OPE into an optimization problem on a feasible set that catches the true state-action value function with high probability. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 24 REFERENCES
Eligibility Traces for Off-Policy Policy Evaluation
TLDR
This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling. Expand
Regularized Off-Policy TD-Learning
TLDR
A novel l1 regularized off-policy convergent TD-learning method, which is able to learn sparse representations of value functions with low computational complexity and low computational cost, is presented. Expand
Reinforcement Learning: An Introduction
TLDR
This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications. Expand
Introduction to Reinforcement Learning
TLDR
In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Expand
Offline policy evaluation across representations with applications to educational games
TLDR
A data-driven methodology for comparing and validating policies offline, which focuses on the ability of each policy to generalize to new data and applies to a partially-observable, high-dimensional concept sequencing problem in an educational game. Expand
Planning treatment of ischemic heart disease with partially observable Markov decision processes
TLDR
This paper shows how the POMDP framework can be used to model and solve the problem of the management of patients with ischemic heart disease (IHD), and demonstrates the modeling advantages of the framework over standard decision formalisms. Expand
Concurrent Reinforcement Learning from Customer Interactions
TLDR
This paper presents the first framework for concurrent reinforcement learning, using a variant of temporal-difference learning to learn efficiently from partial interaction sequences, in applications in which a company interacts concurrently with many customers. Expand
GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces
A new family of gradient temporal-difference learning algorithms have recently been introduced by Sutton, Maei and others in which function approximation is much more straightforward. In this paper,Expand
Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning
TLDR
To the authors' knowledge, this is the first my-oelectric control approach that facilitates the online learning of new amputee-specific motions based only on a one-dimensional (scalar) feedback signal provided by the user of the prosthesis. Expand
Application of the Actor-Critic Architecture to Functional Electrical Stimulation Control of a Human Arm
TLDR
This paper studies the application of the actor-critic architecture, with neural networks for the both the actor and the critic, as a controller that can adapt to changing dynamics of a human arm. Expand
...
1
2
3
...