• Corpus ID: 233033732

On the Optimality of Batch Policy Optimization Algorithms

  title={On the Optimality of Batch Policy Optimization Algorithms},
  author={Chenjun Xiao and Yifan Wu and Tor Lattimore and Bo Dai and Jincheng Mei and Lihong Li and Csaba Szepesv{\'a}ri and Dale Schuurmans},
  booktitle={International Conference on Machine Learning},
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain underdeveloped. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidenceadjusted… 

Figures from this paper

Model Selection in Batch Policy Optimization

It is shown that nobatch policy optimization algorithm can achieve a guarantee addressing all three sources of error simultaneously, revealing a stark contrast between difficulties in batch policy optimization and the positive results available in supervised learning.

Offline Neural Contextual Bandits: Pessimism, Optimization and Generalization

This paper proposes a provably efficient offline contextual bandit with neural network function approximation that does not require any functional assumption on the reward and shows that the method provably generalizes over unseen contexts under a milder condition for distributional shift than the existing OPL works.

On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data

We study the fundamental question of the sample complexity of learning a good policy in nite Markov decision processes (MDPs) when the data available for learning is obtained by following a logging

Pessimism Meets Invariance: Provably Efficient Offline Mean-Field Multi-Agent RL

Mean-Field Multi-Agent Reinforcement Learning (MF-MARL) is attractive in the applications involving a large population of homogeneous agents, as it exploits the permutation invariance of agents and

Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards


In this paper, we study the non-asymptotic and asymptotic performances of the optimal robust policy and value function of robust Markov Decision Processes (MDPs), where the optimal robust policy and

Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

This work analyzes the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches O  H ∑ in this work.

Non-asymptotic Performances of Robust Markov Decision Processes

This paper considers three different uncertainty sets including the L1, χ 2 and KL balls in both (s, a)-rectangular and s-rectangular assumptions to find the non-asymptotic performance of optimal policy on robust value function with true transition dynamics.

Characterizing Uniform Convergence in Offline Policy Evaluation via model-based approach: Offline Learning, Task-Agnostic and Reward-Free

An Ω(HS/dmǫ ) lower bound (over model-based family) for the global uniform OPE, where dm is the minimal state-action distribution induced by the behavior policy, is established and implies the optimal sample complexity for offline learning and separates local uniform O PE from the global case.

ARMOR: A Model-based Framework for Improving Arbitrary Baseline Policies with Offline Data

In theory, it is proved that the learned policy of ARMOR never degrades the performance of the baseline policy with any admissible hyperparameter, and can learn to compete with the best policy within data coverage when thehyperparameter is well tuned, and the baselinepolicy is supported by the data.



Is Pessimism Provably Efficient for Offline RL?

A pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function and establishes a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs).

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

This paper proposes Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL and establishes an information-theoretic lower bound of ⌦(H/dm✏) which certifies that OPDVR is optimal up to logarithmic factors.

The Importance of Pessimism in Fixed-Dataset Policy Optimization

It is shown why pessimistic algorithms can achieve good performance even when the dataset is not informative of every policy, and derive families of algorithms which follow this principle.

A General Framework for Optimal Data-Driven Optimization

A general framework is developed which allows the characterization of decision formulations which are optimal in a precise sense and shows that under certain mild technical assumptions closely related to the existence of a sufficient statistic satisfying a large deviation principle, the optimal decision enjoys an intuitive separation into an estimation and a subsequent robust optimization step.

Provably Good Batch Reinforcement Learning Without Great Exploration

It is shown that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees on the performance of the output policy, and in certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability.

MOPO: Model-based Offline Policy Optimization

A new model-based offline RL algorithm is proposed that applies the variance of a Lipschitz-regularized model as a penalty to the reward function, and it is found that this algorithm outperforms both standard model- based RL methods and existing state-of-the-art model-free offline RL approaches on existing offline RL benchmarks, as well as two challenging continuous control tasks.

A Convex Optimization Approach to Distributionally Robust Markov Decision Processes With Wasserstein Distance

  • Insoon Yang
  • Computer Science
    IEEE Control Systems Letters
  • 2017
The existence and optimality of Markov policies are proved and convex optimization-based tools to compute and analyze the policies are developed and a sensitivity analysis tool is developed to quantify the effect of ambiguity set parameters on the performance of distributionally robust policies.

From Data to Decisions: Distributionally Robust Optimization is Optimal

This paper proposes a meta-optimization problem to find the least conservative predictors and prescriptors subject to constraints on their out-of-sample disappointment and proves that the best predictor-prescriptor-pair is obtained by solving a distributionally robust optimization problem over all distributions within a given relative entropy distance from the empirical distribution of the data.

Distributionally robust optimization for sequential decision-making

This paper generalizes existing works on distributionally robust MDPs with generalized-moment-based and statistical-distance-based ambiguity sets to incorporate information from the former class such as moments and dispersions to the latter class that critically depends on empirical observations of the uncertain parameters.

CoinDICE: Off-Policy Confidence Interval Estimation

This work proposes CoinDICE, a novel and efficient algorithm for computing confidence intervals in high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, and proves the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes.