• Corpus ID: 246240262

IMO3: Interactive Multi-Objective Off-Policy Optimization

  title={IMO3: Interactive Multi-Objective Off-Policy Optimization},
  author={Nan Wang and Hongning Wang and Maryam Karimzadehgan and Branislav Kveton and Craig Boutilier},
Most real-world optimization problems have multiple objectives. A system designer needs to find a policy that trades off these objectives to reach a desired operating point. This problem has been studied extensively in the setting of known objective functions. We consider a more practical but challenging setting of unknown objective functions. In industry, this problem is mostly approached with online A/B testing, which is often costly and inefficient. As an alternative, we propose interactive… 

Figures from this paper


A Flexible Framework for Multi-Objective Bayesian Optimization using Random Scalarizations
This work proposes a strategy based on random scalarizations of the objectives that is able to flexibly sample from desired regions of the Pareto front and, computationally, is considerably cheaper than most approaches for MOO.
Designing multi-objective multi-armed bandits algorithms: A study
A variant of the scalarized multi-objective UCB1 that removes online inefficient scalarizations in order to improve the algorithm's efficiency is introduced and these algorithms are experimentally compared on multi- objective Bernoulli distributions, ParetoUCB1 being the algorithm with the best empirical performance.
Interactive Thompson Sampling for Multi-objective Multi-armed Bandits
This paper proposes two algorithms: Utility-MAP UCB (umap-UCB) and Interactive Thompson Sampling (ITS) and shows empirically that the performance of these algorithms in terms of regret closely approximates the regret of UCB and regular Thompson sampling provided with the ground truth utility function of the user from the start, and that ITS outperforms umap- UCB.
Thompson Sampling for Multi-Objective Multi-Armed Bandits Problem
This paper extends the Thompson Sampling approach to the multi-objective multi-armed bandit problem and compares empirically between ParetoThompson Sampling and linear scalarized Thompson Samplings on a test suite of MOMAB problems with Bernoulli distributions.
Doubly Robust Policy Evaluation and Learning
It is proved that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies, and is expected to become common practice.
Solving real-world multi-objective engineering optimization problems with an Election-Based Hyper-Heuristic ?
Evaluating MOABHH in four realworld multi-objective engineering problems shows that the strategy aways find solutions at least equals to the ones generated by the best algorithm, and sometimes even overcomes these results.
A POMDP formulation of preference elicitation problems
Methods that exploit the special structure of preference elicitation to deal with parameterized belief states over the continuous state space, and gradient techniques for optimizing parameterized actions are described.
Random Hypervolume Scalarizations for Provable Multi-Objective Black Box Optimization
This paper introduces a novel scalarization function, which it is shown that drawing random scalarizations from an appropriately chosen distribution can be used to efficiently approximate the hypervolume indicator metric, and highlights the general utility of this framework by showing that any provably convergent single-objective optimization process can be effortlessly converted to a multi-objectives optimization process with provable convergence guarantees.
Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets
This paper examines EVOI optimization using choice queries, queries in which a user is ask to select her most preferred product from a set, and shows that, under very general assumptions, the optimal choice query w.r.t.
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
This work develops a learning principle and an efficient algorithm for batch learning from logged bandit feedback and shows how CRM can be used to derive a new learning method - called Policy Optimizer for Exponential Models (POEM - for learning stochastic linear rules for structured output prediction.