Corpus ID: 3345089

Learning Mixture of Gaussians with Streaming Data

  title={Learning Mixture of Gaussians with Streaming Data},
  author={Aditi Raghunathan and Prateek Jain and Ravishankar Krishnaswamy},
In this paper, we study the problem of learning a mixture of Gaussians with streaming data: given a stream of $N$ points in $d$ dimensions generated by an unknown mixture of $k$ spherical Gaussians, the goal is to estimate the model parameters using a single pass over the data stream. We analyze a streaming version of the popular Lloyd's heuristic and show that the algorithm estimates all the unknown centers of the component Gaussians accurately if they are sufficiently separated. Assuming each… Expand
Coresets for Gaussian Mixture Models of Any Shape
The main technique is a reduction between coresets for $k$-GMMs and projective clustering problems, and it is hoped that these coresets, which are generic, with no special dependency on GMMs, will be useful for many other functions. Expand
Mixture of GANs for Clustering
The experiments show that the proposed GANMM can have good performance on complex data as well as simple data and disables the commonly employed expectation-maximization procedure. Expand
No-substitution k-means Clustering with Adversarial Order
A new complexity measure is introduced that quantifies the difficulty of clustering a dataset arriving in arbitrary order and designs a new random algorithm and proves that if applied on data with complexity d, the algorithm takesO(d log(n)k log(k)) centers and is anO(k)-approximation. Expand
Fast-BoW: Scaling Bag-of-Visual-Words Generation
This paper replaces the process of finding the closest cluster center with a softmax classifier which improves the cluster boundaries over k-means and also can be used for both hard and soft BoW encoding, and quantizes the real weights into integer weights which can be represented using few bits only. Expand
A scalable method Fast-BoW is presented for reducing the computation time of bag of-visual-words (BoW) feature generation for both hard and soft vector-quantization with time complexities, and Genetic-SVM which makes use of the distributed genetic algorithm to reduce the time taken in solving the SVM objective function. Expand
Variable size sampling to support high uniformity confidence in sensor data streams
The proposed UC-KSample is an excellent approach that adopts an advantage of KSample, dynamic sampling over a fixed sampling ratio, while improving the uniformity confidence. Expand


Ten Steps of EM Suffice for Mixtures of Two Gaussians
This work shows that the population version of EM, where the algorithm is given access to infinitely many samples from the mixture, converges geometrically to the correct mean vectors, and provides simple, closed-form expressions for the convergence rate. Expand
Sample-Efficient Learning of Mixtures
This work provides a method for learning PAC learning of probability distributions with sample complexity O, and shows that the class of mixtures of $k$ axis-aligned Gaussians in $\mathbb{R}^d$ is PAC-learnable in the agnostic setting with $\widetilde{O}({kd}/{\epsilon ^ 4})$ samples, which is tight in$ and $d$ up to logarithmic factors. Expand
Convergence Rate of Stochastic k-means
It is shown, for the first time, that starting with any initial solution, online and mini-batch variants of the widely used k-means algorithm converge to a "local optimum" at rate $O(\frac{1}{t})$ (in terms of the $k$-mean objective) under general conditions. Expand
Clustering with Spectral Norm and the k-Means Algorithm
  • Amit Kumar, R. Kannan
  • Mathematics, Computer Science
  • 2010 IEEE 51st Annual Symposium on Foundations of Computer Science
  • 2010
This paper shows that a simple clustering algorithm works without assuming any generative (probabilistic) model, and proves some new results for generative models - e.g., it can cluster all but a small fraction of points only assuming a bound on the variance. Expand
A spectral algorithm for learning mixture models
We show that a simple spectral algorithm for learning a mixture of k spherical Gaussians in Rn works remarkably well--it succeeds in identifying the Gaussians assuming essentially the minimumExpand
Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja's Algorithm
This work shows that simply picking a random initial point and applying the update rule suffices to accurately estimate the top eigenvector, with a suitable choice of $\eta_i$, and sheds light on how to efficiently perform streaming PCA both in theory and in practice. Expand
Spectral clustering with limited independence
This paper considers the well-studied problem of clustering a set of objects under a probabilistic model of data in which each object is represented as a vector over the set of features, and thereExpand
Memory Limited, Streaming PCA
An algorithm is presented that uses O(kp) memory and is able to compute the k-dimensional spike with O(p log p) sample-complexity - the first algorithm of its kind. Expand
Statistical guarantees for the EM algorithm: From population to sample-based analysis
A general framework for proving rigorous guarantees on the performance of the EM algorithm and a variant known as gradient EM and consequences of the general theory for three canonical examples of incomplete-data problems are developed. Expand
On Lloyd's Algorithm: New Theoretical Insights for Clustering in Practice
Any O(k)-approximation seeding + Lloyd’s update works, and Lloyd's algorithm has linear convergence before reaching plateu. Expand