• Corpus ID: 43973573

A Practical Algorithm for Distributed Clustering and Outlier Detection

  title={A Practical Algorithm for Distributed Clustering and Outlier Detection},
  author={Jiecao Chen and Erfan Sadeqi Azer and Qin Zhang},
We study the classic $k$-means/median clustering, which are fundamental problems in unsupervised learning, in the setting where data are partitioned across multiple sites, and where we are allowed to discard a small portion of the data by labeling them as outliers. We propose a simple approach based on constructing small summary for the original dataset. The proposed method is time and communication efficient, has good approximation guarantees, and can identify the global outliers effectively… 

Figures and Tables from this paper

k-Clustering with Fair Outliers

This work studies the problem of k-clustering with fair outlier removal and provides the first approximation algorithm for well-known clustering formulations, such as k-means and k-median, and analyzes this algorithm and proves that it has strong theoretical guarantees.

Distributed k-Means with Outliers in General Metrics

A distributed coreset-based 3-round approximation algorithm for k-means with z outliers for general metric spaces, using MapReduce as a computational model, that obliviously adapts to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space.

The Effectiveness of Uniform Sampling for Center-Based Clustering with Outliers

A "significance" criterion is introduced and it is proved that the performance of uniform sampling depends on the significance degree of the given instance, and the first uniform sampling approach that allows to discard exactly exactly $z$ outliers for these three center-based clustering with outliers problems is proposed.

Is Simple Uniform Sampling Efficient for Center-Based Clustering With Outliers: When and Why?

This paper proposes a simple uniform sampling framework for solving three representative center-based clustering with outliers problems: k -center/median/means clusteringwith outliers, and introduces a measure of “significance” and proves that the performance of the framework depends on the signi-cance degree of the given instance.

Distributed k-Clustering for Data with Heavy Noise

The number of outliers is improved to the best possible $(1+\epsilon)z$, while maintaining the $O(1)$-approximation ratio and independence of communication cost on $z$.

Fast Noise Removal for k-Means Clustering

A simple greedy algorithm is developed that has provably strong worst case guarantees and gives the first pseudo-approximation-preserving reduction from k-means with outliers to $k-mean without outliers, a scalable, near linear time algorithm.

Adapting k-means algorithms for outliers

This paper shows how to adapt several sequential and distributed $k-means algorithms to the setting with outliers, but with substantially stronger theoretical guarantees: their algorithms output $(1+\varepsilon)z$ outliers while achieving an $O(1 / \vare psilon)-approximation to the objective function.

Parallel and Efficient Hierarchical k-Median Clustering

This paper introduces a new parallel algorithm for the Euclidean hierarchical k -median problem that outputs a hierarchical clustering such that for every value of k the cost of the solution is at most an O (min { d, log n } log ∆) factor larger in expectation than that of an optimal solution.

Layered Sampling for Robust Optimization Problems

This paper proposes a new variant of coreset technique, {\em layered sampling}, to deal with two fundamental robust optimization problems: {\em $k$-median/means clustering with outliers} and {\em linear regression with outlier}.



k-means-: A Unified Approach to Clustering and Outlier Detection

It is proved that the problem is NP-hard and then a practical polynomial time algorithm is presented, which is guaranteed to converge to a local optimum, and the approach is formalized as a generalization of the k-means problem.

Local Search Methods for k-Means with Outliers

This work proposes a simple local search-based algorithm for k-means clustering with outliers and proves that this algorithm achieves constant-factor approximate solutions and can be combined with known sketching techniques to scale to large data sets.

Distributed k-Clustering for Data with Heavy Noise

The number of outliers is improved to the best possible $(1+\epsilon)z$, while maintaining the $O(1)$-approximation ratio and independence of communication cost on $z$.

Data reduction for weighted and outlier-resistant clustering

The essential challenge that arises in these optimization problems is data reduction for the weighted k-median problem, and this work solves this problem, which was previously solved only in one dimension ([Har-Peled FSTTCS' 06], [Feldman, Fiat and Sharir FOCS'06]).

Distributed k-means and k-median clustering on general communication topologies

A distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies is provided.

Optimal Time Bounds for Approximate Clustering

Using successive sampling, an algorithm is developed for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal.

Distributed Partial Clustering

This paper develops the first algorithms for the partial k-median and means objectives that run in subquadratic running time and initiates the study of distributed algorithms for clustering uncertain data, where each data point can possibly fall into multiple locations under certain probability distribution.

Fast clustering using MapReduce

This paper designs clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets, and focuses on the practical and popular clustering problems, k-center and k-median.

Communication-Optimal Distributed Clustering

This work highlights the surprising power of a broadcast channel for clustering problems; roughly speaking, to cluster n points or n vertices in a graph distributed across s servers, for a worst-case partitioning the communication complexity in a point-to-point model is n*, while in the broadcast model it is n + s.

Improved Distributed Principal Component Analysis

New algorithms and analyses for distributed PCA are given which lead to improved communication and computational costs for k-means clustering and related problems, and a speedup of orders of magnitude is shown on real world data.