#### Filter Results:

#### Publication Year

2007

2016

#### Co-author

#### Key Phrase

#### Publication Venue

Learn More

This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models—including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation—which exploits a certain tensor structure in their low-order observable moments (typically, of second-and third-order). Specifically,… (More)

This manuscript develops the theory of ag-glomerative clustering with Bregman divergences. Geometric smoothing techniques are developed to deal with degenerate clusters. To allow for cluster models based on exponential families with overcomplete representations , Bregman divergences are developed for nondifferentiable convex functions.

Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak… (More)

Suppose k centers are fit to m points by heuristically minimizing the k-means cost; what is the corresponding fit over the source distribution? This question is resolved here for distributions with p ≥ 4 bounded moments; in particular, the difference between the sample cost and distribution cost decays with m and p as m min{−1/4,−1/2+2/p}. The essential… (More)

This manuscript shows that AdaBoost and its immediate variants can produce approximate maximum margin classifiers simply by scaling step size choices with a fixed small constant. In this way, when the unscaled step size is an optimal choice, these results provide guarantees for Friedman's empirically successful " shrinkage " procedure for gradient boosting… (More)

For any positive integer k, there exist neural networks with Θ(k 3) layers, Θ(1) nodes per layer, and Θ(1) distinct parameters which can not be approximated by networks with O(k) layers unless they are exponentially large — they must possess Ω(2 k) nodes. This result is proved here for a class of nodes termed semi-algebraic gates which includes the common… (More)

Hartigan's method for k-means clustering is the following greedy heuristic: select a point, and optimally reassign it. This paper develops two other formulations of the heuris-tic, one leading to a number of consistency properties, the other showing that the data partition is always quite separated from the induced Voronoi partition. A characterization of… (More)

This note provides a family of classification problems, indexed by a positive integer k, where all shallow networks with fewer than exponentially (in k) many nodes exhibit error at least 1/6, whereas a deep network with 2 nodes in each of 2k layers achieves zero error, as does a recurrent network with 3 distinct nodes iterated k times. The proof is… (More)

This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1// 2)). First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1//)). Next, the (disjoint) conditions under which… (More)

This note provides an elementary proof of the folklore fact that draws from a Dirichlet distribution (with parameters less than 1) are typically sparse (most coordinates are small). Let Dir(α) denote a Dirichlet distribution with all parameters equal to α. ≤ 6c 0 ln(n) ≥ 1 − 1 n c0. The parameter is taken to be 1/n, which is standard in machine learning.… (More)