Learn More
We develop a new k-means clustering algorithm for data streams, which we call StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the k-means++ algorithm [1]. To compute the small sample, we propose two new techniques. First, we use a non-uniform sampling approach similar to the k-means++(More)
We develop a new <it>k</it>-means clustering algorithm for data streams of points from a Euclidean space. We call this algorithm StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the <it>k</it>-means++ algorithm of Arthur and Vassilvitskii (SODA '07). To compute(More)
In this paper, we present a randomized constant factor approximation algorithm for the metric minimum facility location problem with uniform costs and demands in a distributed setting, in which every point can open a facility. In particular, our distributed algorithm uses three communication rounds with message sizes bounded to <i>O</i>(log <i>n</i>) bits(More)
This document contains a list of open problems and research directions that have been suggested by participants at the Bertinoro Workshop on Sublinear Algorithms (May 2011) and IITK Workshop on Algorithms for Processing Massive Data Sets (December 2009). Many of the questions were discussed at the workshop or were posed during presentations. Further details(More)
The k-means algorithm is one of the most widely used clustering heuristics. Despite its simplicity, analyzing its running time and quality of approximation is surprisingly difficult and can lead to deep insights that can be used to improve the algorithm. In this paper we survey the recent results in this direction as well as several extension of the basic(More)
The focus of our work is introducing and constructing probabilistic coresets. A probabilistic coreset can contain probabilistic points, and the number of these points should be polylogarithmic in the input size. However, the overall storage size is also influenced by representation size of the propability distribution of each point. So, our first(More)
We present a deterministic kinetic data structure for the facility location problem that maintains a subset of the moving points as facilities such that, at any point of time, the accumulated cost for the whole point set is at most a constant factor larger than the optimal cost. In our scenario, each point can change its status between client and facility(More)
We study the problem of computing low-distortion embed-dings in the streaming model. We present streaming algorithms that, given an n-point metric space M , compute an embedding of M into an n-point metric space M that preserves a (1−σ)-fraction of the distances with small distortion (σ is called the slack). Our algorithms use space polylogarithmic in n and(More)