• Corpus ID: 244715202

Robust On-Policy Data Collection for Data-Efficient Policy Evaluation

  title={Robust On-Policy Data Collection for Data-Efficient Policy Evaluation},
  author={Rujie Zhong and Josiah P. Hanna and Lukas Sch{\"a}fer and Stefano V. Albrecht},
This paper considers how to complement offline reinforcement learning (RL) data with additional data collection for the task of policy evaluation. In policy evaluation, the task is to estimate the expected return of an evaluation policy on an environment of interest. Prior work on offline policy evaluation typically only considers a static dataset. We consider a setting where we can collect a small amount of additional data to combine with a potentially larger offline RL dataset. We show that… 
1 Citations
Decoupled Reinforcement Learning to Stabilise Intrinsically-Motivated Exploration
Decoupled RL is introduced as a general framework which trains separate policies for intrinsicallymotivated exploration and exploitation and decoupling allows DeRL to leverage the benefits of intrinsic rewards for exploration while demonstrating improved robustness and sample efficiency.


Importance Sampling Policy Evaluation with an Estimated Behavior Policy
This paper studies importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate, and finds that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampled with the true behavior policy or using a behavior policy that is estimated from a separate data set.
Data-Efficient Policy Evaluation Through Behavior Policy Search
A novel policy evaluation sub-problem is proposed, behavior policy search: searching for a behavior policy that reduces mean squared error, and it is shown that the data collected from deploying a different policy can be used to produce unbiased estimates with lowermean squared error.
Toward Minimax Off-policy Value Estimation
It is shown that while the so-called regression estimator is asymptotically optimal, for small sample sizes it may perform suboptimally compared to an ideal oracle up to a multiplicative factor that depends on the number of actions.
Eligibility Traces for Off-Policy Policy Evaluation
This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
A new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy, based on an extension of the doubly robust estimator and a new way to mix between model based estimates and importance sampling based estimates.
Importance sampling in reinforcement learning with an estimated behavior policy
This article studies importance sampling where the behavior policy action probabilities are replaced by their maximum likelihood estimate of these probabilities under the observed data, and shows this general technique reduces variance due to sampling error in Monte Carlo style estimators.
Batch Policy Learning under Constraints
A new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds is proposed and achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving.
Reducing Sampling Error in Batch Temporal Difference Learning
The concept of a certainty-equivalence estimate is refined and it is argued that PSEC-TD(0), a canonical TD algorithm, is a more data efficient estimator than TD(0) for a fixed batch of data.
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
This work develops a novel class of off-policy batch RL algorithms, able to effectively learn offline, without exploring, from a fixed batch of human interaction data, using models pre-trained on data as a strong prior, and uses KL-control to penalize divergence from this prior during RL training.
OFFER: Off-Environment Reinforcement Learning
It is proved that OFFER converges to a locally optimal policy and it is shown experimentally that it learns better and faster than a policy gradient baseline.