#### Filter Results:

#### Publication Year

2007

2016

#### Publication Type

#### Co-author

#### Key Phrase

#### Publication Venue

#### Data Set Used

Learn More

Stochastic gradient descent (SGD) is a simple and popular method to solve stochas-tic optimization problems which arise in machine learning. For strongly convex problems , its convergence rate was known to be O(log(T)/T), by running SGD for T iterations and returning the average point. However , recent results showed that using a different algorithm, one… (More)

For supervised classification problems, it is well known that learnability is equivalent to uniform convergence of the empirical risks and thus to learnability by empirical minimization. Inspired by recent regret bounds for online convex optimization , we study stochastic convex optimization , and uncover a surprisingly different situation in the more… (More)

Stochastic Gradient Descent (SGD) is one of the simplest and most popular stochas-tic optimization methods. While it has already been theoretically studied for decades, the classical analysis usually required non-trivial smoothness assumptions, which do not apply to many modern applications of SGD with non-smooth objective functions such as support vector… (More)

We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where… (More)

The versatility of exponential families, along with their attendant convexity properties , make them a popular and effective statistical model. A central issue is learning these models in high-dimensions, such as when there is some sparsity pattern of the optimal parameter. This work characterizes a certain strong con-vexity property of general exponential… (More)

It is well-known that neural networks are computationally hard to train. On the other hand, in practice, modern day neural networks are trained efficiently using SGD and a variety of tricks that include different activation functions (e.g. ReLU), over-specification (i.e., train networks which are larger than needed), and regularization. In this paper we… (More)

- Ohad Shamir
- ICML
- 2016

We study the convergence properties of the VR-PCA algorithm introduced by [19] for fast computation of leading singular vectors. We prove several new results, including a formal analysis of a block version of the algorithm, and convergence from random initialization. We also make a few observations of independent interest, such as how pre-initializing with… (More)

The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is… (More)

We address the problem of minimizing a convex function over the space of large matrices with low rank. While this optimization problem is hard in general, we propose an efficient greedy algorithm and derive its formal approximation guarantees. Each iteration of the algorithm involves (approximately) finding the left and right singular vectors corresponding… (More)

We present a novel Newton-type method for distributed optimization, which is particularly well suited for stochastic optimization and learning problems. For quadratic objectives, the method enjoys a linear rate of convergence which prov-ably improves with the data size, requiring an essentially constant number of iterations under reasonable assumptions. We… (More)