#### Filter Results:

#### Publication Year

2006

2016

#### Publication Type

#### Co-author

#### Publication Venue

#### Key Phrases

Learn More

We develop a new k-means clustering algorithm for data streams, which we call StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the k-means++ algorithm [1]. To compute the small sample, we propose two new techniques. First, we use a non-uniform sampling approach similar to the k-means++… (More)

We develop a new <it>k</it>-means clustering algorithm for data streams of points from a Euclidean space. We call this algorithm StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the <it>k</it>-means++ algorithm of Arthur and Vassilvitskii (SODA '07). To compute… (More)

In this paper, we present a randomized constant factor approximation algorithm for the metric minimum facility location problem with uniform costs and demands in a distributed setting, in which every point can open a facility. In particular, our distributed algorithm uses three communication rounds with message sizes bounded to <i>O</i>(log <i>n</i>) bits… (More)

- Piotr Indyk, Andrew McGregor, Ilan Newman, Nir Ailon, Noga Alon, Alexandr Andoni +47 others
- 2011

This document contains a list of open problems and research directions that have been suggested by participants at the Bertinoro Workshop on Sublinear Algorithms (May 2011) and IITK Workshop on Algorithms for Processing Massive Data Sets (December 2009). Many of the questions were discussed at the workshop or were posed during presentations. Further details… (More)

The focus of our work is introducing and constructing probabilistic coresets. A probabilistic coreset can contain probabilistic points, and the number of these points should be polylogarithmic in the input size. However, the overall storage size is also influenced by representation size of the propability distribution of each point. So, our first… (More)

The k-means algorithm is one of the most widely used clustering heuristics. Despite its simplicity, analyzing its running time and quality of approximation is surprisingly difficult and can lead to deep insights that can be used to improve the algorithm. In this paper we survey the recent results in this direction as well as several extension of the basic… (More)