Corpus ID: 53030121

Learning to Route Efficiently with End-to-End Feedback: The Value of Networked Structure

@article{Zhu2018LearningTR,
  title={Learning to Route Efficiently with End-to-End Feedback: The Value of Networked Structure},
  author={Ruihao Zhu and Eytan H. Modiano},
  journal={ArXiv},
  year={2018},
  volume={abs/1810.10637}
}
We introduce efficient algorithms which achieve nearly optimal regrets for the problem of stochastic online shortest path routing with end-to-end feedback. The setting is a natural application of the combinatorial stochastic bandits problem, a special case of the linear stochastic bandits problem. We show how the difficulties posed by the large scale action set can be overcome by the networked structure of the action set. Our approach presents a novel connection between bandit learning and… Expand
1 Citations
Meta Dynamic Pricing: Transfer Learning Across Experiments
TLDR
A meta dynamic pricing algorithm that learns an unknown prior online while solving a sequence of Thompson sampling pricing experiments for N different products, and shows that it significantly speeds up learning compared to prior-independent algorithms. Expand

References

SHOWING 1-10 OF 43 REFERENCES
Online linear optimization and adaptive routing
TLDR
This paper presents two randomized online algorithms for selecting a sequence of routing paths in a network with unknown edge delays varying adversarially over time, and presents two algorithms whose regret is sublinear in the number of trials and polynomial in the size of the network. Expand
Adaptive shortest-path routing under unknown and stochastically varying link states
  • K. Liu, Qing Zhao
  • Computer Science
  • 2012 10th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt)
  • 2012
TLDR
By exploiting arm dependencies, a regret polynomial with the network size can be achieved while maintaining the optimal logarithmic order with time and find applications in cognitive radio and ad hoc networks with unknown and dynamic communication environments. Expand
Stochastic Online Shortest Path Routing: The Value of Feedback
TLDR
Three algorithms, with a tradeoff between computational complexity and performance, are proposed for online shortest path routing over multihop networks and the regret upper bounds of these algorithms improve over those of the existing algorithms. Expand
Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits
TLDR
This paper analyzes a UCB-like algorithm for solving the problem of computationally and sample efficient learning in stochastic combinatorial semi-bandits and proves O(K L (1 / \Delta) \log n) and O(\sqrt{K L n n}$ upper bounds on its $n$-step regret. Expand
Finite-time Analysis of the Multiarmed Bandit Problem
TLDR
This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support. Expand
Linear Thompson Sampling Revisited
TLDR
Thompson sampling can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrt{d}$ regret factor compared to a UCB-like approach. Expand
Stochastic Linear Optimization under Bandit Feedback
TLDR
A nearly complete characterization of the classical stochastic k-armed bandit problem in terms of both upper and lower bounds for the regret is given, and two variants of an algorithm based on the idea of “upper confidence bounds” are presented. Expand
Regret in Online Combinatorial Optimization
TLDR
This work addresses online linear optimization problems when the possible actions of the decision maker are represented by binary vectors and shows that the standard exponentially weighted average forecaster is a provably suboptimal strategy. Expand
Learning to Optimize via Posterior Sampling
TLDR
A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards. Expand
Improved Algorithms for Linear Stochastic Bandits
TLDR
A simple modification of Auer's UCB algorithm achieves with high probability constant regret and improves the regret bound by a logarithmic factor, though experiments show a vast improvement. Expand
...
1
2
3
4
5
...