• Corpus ID: 245424947

Model Selection in Batch Policy Optimization

  title={Model Selection in Batch Policy Optimization},
  author={Jonathan N. Lee and G. Tucker and Ofir Nachum and Bo Dai},
We study the problem of model selection in batch policy optimization: given a fixed, partial-feedback dataset and M model classes, learn a policy with performance that is competitive with the policy derived from the best model class. We formalize the problem in the contextual bandit setting with linear model classes by identifying three sources of error that any model selection algorithm should optimally trade-off in order to be competitive: (1) approximation error, (2) statistical complexity… 
2 Citations

Figures from this paper

Bellman Residual Orthogonalization for Offline Reinforcement Learning
A new reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along an user-defined space of test functions is introduced, and an oracle inequality is proved on the authors' policy optimization procedure in terms of a trade-off between the value and uncertainty of an arbitrary comparator policy.
Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning
It is suggested that data generation is as important as algorithmic advances for offline RL and hence requires careful consideration from the committee, and that exploratory data allows vanilla off-policy RL algorithms to outperform or match state-of-the-art ofﵡine RL algorithms on downstream tasks.


On the Optimality of Batch Policy Optimization Algorithms
This work introduces a class of confidenceadjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis and introduces a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation
An easily computable confidence bound for the policy evaluator is provided, which may be useful for optimistic planning and safe policy improvement, and establishes a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound.
Online Model Selection for Reinforcement Learning with Function Approximation
A meta-algorithm is presented that successively rejects increasingly complex models using a simple statistical test and automatically admits significantly improved instance-dependent regret bounds that depend on the gaps between the maximal values attainable by the candidates.
Hyperparameter Selection for Offline Reinforcement Learning
This work focuses on offline hyperparameter selection, i.e. methods for choosing the best policy from a set of many policies trained using different hyperparameters, given only logged data, and shows that offline RL algorithms are not robust tohyperparameter choices.
Pessimistic Model-based Offline RL: PAC Bounds and Posterior Sampling under Partial Coverage
It is demonstrated that this algorithmic framework can be applied to many specialized Markov Decision Processes where the additional structural assumptions can further refine the concept of partial coverage.
DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.
Batch Value-function Approximation with Only Realizability
The algorithm, BVFT, breaks the hardness conjecture via a tournament procedure that reduces the learning problem to pairwise comparison, and solves the latter with the help of a state-action partition constructed from the compared functions.
AlgaeDICE: Policy Gradient from Arbitrary Experience
A new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution and shows that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-Policy objective is exactly the on-policy policy gradient, without any use of importance weighting.
Offline Policy Selection under Uncertainty
It is shown how the belief distribution estimated by BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric, and it is empirically demonstrated that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates.
Towards Hyperparameter-free Policy Selection for Offline Reinforcement Learning
This paper designs hyperparameter-free algorithms for policy selection based on BVFT, a recent theoretical advance in value-function selection, and demonstrates their effectiveness in discrete-action benchmarks such as Atari.