Learn More
Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the(More)
In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing(More)
In the multi-armed bandit problem, a gambler must decide which arm of K non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing(More)
  • Peter Auer
  • Journal of Machine Learning Research
  • 2002
We show how a standard tool from statistics — namely confidence bounds — can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off. Our technique for designing and analyzing algorithms for such situations is general and can be applied when an algorithm has to make exploitation-versus-exploration decisions based on(More)
This paper explores the power and the limitations of weakly supervised categorization. We present a complete framework that starts with the extraction of various local regions of either discontinuity or homogeneity. A variety of local descriptors can be applied to form a set of feature vectors for each local region. Boosting is used to learn a subset of(More)
In this paper we describe the first stage of a new learning system for object detection and recognition. For our system we propose Boosting [5] as the underlying learning technique. This allows the use of very diverse sets of visual features in the learning process within a common framework: Boosting — together with a weak hypotheses finder — may choose(More)
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D(More)
Most of the performance bounds for on-line learning algorithms are proven assuming a constant learning rate. To optimize these bounds, the learning rate must be tuned based on quantities that are generally unknown, as they depend on the whole sequence of examples. In this paper we show that essentially the same optimized bounds can be obtained when the(More)
We consider the problem of selecting, from among the arms of a stochastic n-armed bandit, a subset of size m of those arms with the highest expected rewards, based on efficiently sampling the arms. This “subset selection” problem finds application in a variety of areas. In the authors’ previous work (Kalyanakrishnan & Stone, 2010), this problem is framed(More)
ABSTRACT. In the stochastic multi-armed bandit problem we consider a modification of the UCB algorithm of Auer et al. [4]. For this modified algorithm we give an improved bound on the regret with respect to the optimal reward. While for the original UCB algorithm the regret in Karmed bandits after T trials is bounded by const · K log(T ) , where measures(More)