# Model Selection in Batch Policy Optimization

@article{Lee2021ModelSI, title={Model Selection in Batch Policy Optimization}, author={Jonathan N. Lee and G. Tucker and Ofir Nachum and Bo Dai}, journal={ArXiv}, year={2021}, volume={abs/2112.12320} }

We study the problem of model selection in batch policy optimization: given a fixed, partial-feedback dataset and M model classes, learn a policy with performance that is competitive with the policy derived from the best model class. We formalize the problem in the contextual bandit setting with linear model classes by identifying three sources of error that any model selection algorithm should optimally trade-off in order to be competitive: (1) approximation error, (2) statistical complexity…

## 2 Citations

Bellman Residual Orthogonalization for Offline Reinforcement Learning

- Mathematics, Computer ScienceArXiv
- 2022

A new reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along an user-deﬁned space of test functions is introduced, and an oracle inequality is proved on the authors' policy optimization procedure in terms of a trade-oﬀ between the value and uncertainty of an arbitrary comparator policy.

Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning

- Computer ScienceArXiv
- 2022

It is suggested that data generation is as important as algorithmic advances for ofﬂine RL and hence requires careful consideration from the committee, and that exploratory data allows vanilla off-policy RL algorithms to outperform or match state-of-the-art ofﵡine RL algorithms on downstream tasks.

## References

SHOWING 1-10 OF 49 REFERENCES

On the Optimality of Batch Policy Optimization Algorithms

- Computer ScienceICML
- 2021

This work introduces a class of confidenceadjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis and introduces a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

- Computer ScienceICML
- 2020

An easily computable confidence bound for the policy evaluator is provided, which may be useful for optimistic planning and safe policy improvement, and establishes a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound.

Online Model Selection for Reinforcement Learning with Function Approximation

- Computer ScienceAISTATS
- 2021

A meta-algorithm is presented that successively rejects increasingly complex models using a simple statistical test and automatically admits significantly improved instance-dependent regret bounds that depend on the gaps between the maximal values attainable by the candidates.

Hyperparameter Selection for Offline Reinforcement Learning

- Computer ScienceArXiv
- 2020

This work focuses on offline hyperparameter selection, i.e. methods for choosing the best policy from a set of many policies trained using different hyperparameters, given only logged data, and shows that offline RL algorithms are not robust tohyperparameter choices.

Pessimistic Model-based Offline RL: PAC Bounds and Posterior Sampling under Partial Coverage

- Computer ScienceArXiv
- 2021

It is demonstrated that this algorithmic framework can be applied to many specialized Markov Decision Processes where the additional structural assumptions can further refine the concept of partial coverage.

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

- Computer ScienceNeurIPS
- 2019

This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.

AlgaeDICE: Policy Gradient from Arbitrary Experience

- Computer ScienceArXiv
- 2019

A new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution and shows that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-Policy objective is exactly the on-policy policy gradient, without any use of importance weighting.

Offline Policy Selection under Uncertainty

- Computer ScienceAISTATS
- 2022

It is shown how the belief distribution estimated by BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric, and it is empirically demonstrated that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates.

Towards Hyperparameter-free Policy Selection for Offline Reinforcement Learning

- Computer ScienceNeurIPS
- 2021

This paper designs hyperparameter-free algorithms for policy selection based on BVFT, a recent theoretical advance in value-function selection, and demonstrates their effectiveness in discrete-action benchmarks such as Atari.

Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles

- Computer ScienceAISTATS
- 2020

This paper considers a setting where an ensemble of pre-trained and possibly inaccurate simulators (models) are available, and approximate the real environment using a state-dependent linear combination of the ensemble, where the coefficients are determined by the given state features and some unknown parameters.