# A Practical Algorithm for Distributed Clustering and Outlier Detection

@article{Chen2018APA, title={A Practical Algorithm for Distributed Clustering and Outlier Detection}, author={Jiecao Chen and Erfan Sadeqi Azer and Qin Zhang}, journal={ArXiv}, year={2018}, volume={abs/1805.09495} }

We study the classic $k$-means/median clustering, which are fundamental problems in unsupervised learning, in the setting where data are partitioned across multiple sites, and where we are allowed to discard a small portion of the data by labeling them as outliers. We propose a simple approach based on constructing small summary for the original dataset. The proposed method is time and communication efficient, has good approximation guarantees, and can identify the global outliers effectively…

## 18 Citations

### k-Clustering with Fair Outliers

- Computer ScienceWSDM
- 2022

This work studies the problem of k-clustering with fair outlier removal and provides the first approximation algorithm for well-known clustering formulations, such as k-means and k-median, and analyzes this algorithm and proves that it has strong theoretical guarantees.

### Distributed k-Means with Outliers in General Metrics

- Computer Science, MathematicsArXiv
- 2022

A distributed coreset-based 3-round approximation algorithm for k-means with z outliers for general metric spaces, using MapReduce as a computational model, that obliviously adapts to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space.

### The Effectiveness of Uniform Sampling for Center-Based Clustering with Outliers

- Computer Science
- 2019

A "significance" criterion is introduced and it is proved that the performance of uniform sampling depends on the significance degree of the given instance, and the first uniform sampling approach that allows to discard exactly exactly $z$ outliers for these three center-based clustering with outliers problems is proposed.

### Is Simple Uniform Sampling Efficient for Center-Based Clustering With Outliers: When and Why?

- Computer ScienceArXiv
- 2021

This paper proposes a simple uniform sampling framework for solving three representative center-based clustering with outliers problems: k -center/median/means clusteringwith outliers, and introduces a measure of “signiﬁcance” and proves that the performance of the framework depends on the signi-cance degree of the given instance.

### Distributed k-Clustering for Data with Heavy Noise

- Computer ScienceNeurIPS
- 2018

The number of outliers is improved to the best possible $(1+\epsilon)z$, while maintaining the $O(1)$-approximation ratio and independence of communication cost on $z$.

### SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets

- Computer ScienceKnowl. Based Syst.
- 2021

### Fast Noise Removal for k-Means Clustering

- Computer ScienceAISTATS
- 2020

A simple greedy algorithm is developed that has provably strong worst case guarantees and gives the first pseudo-approximation-preserving reduction from k-means with outliers to $k-mean without outliers, a scalable, near linear time algorithm.

### Adapting k-means algorithms for outliers

- Computer ScienceICML
- 2022

This paper shows how to adapt several sequential and distributed $k-means algorithms to the setting with outliers, but with substantially stronger theoretical guarantees: their algorithms output $(1+\varepsilon)z$ outliers while achieving an $O(1 / \vare psilon)-approximation to the objective function.

### Parallel and Efficient Hierarchical k-Median Clustering

- Computer ScienceNeurIPS
- 2021

This paper introduces a new parallel algorithm for the Euclidean hierarchical k -median problem that outputs a hierarchical clustering such that for every value of k the cost of the solution is at most an O (min { d, log n } log ∆) factor larger in expectation than that of an optimal solution.

### Layered Sampling for Robust Optimization Problems

- Computer ScienceICML
- 2020

This paper proposes a new variant of coreset technique, {\em layered sampling}, to deal with two fundamental robust optimization problems: {\em $k$-median/means clustering with outliers} and {\em linear regression with outlier}.

## References

SHOWING 1-10 OF 20 REFERENCES

### k-means-: A Unified Approach to Clustering and Outlier Detection

- Computer ScienceSDM
- 2013

It is proved that the problem is NP-hard and then a practical polynomial time algorithm is presented, which is guaranteed to converge to a local optimum, and the approach is formalized as a generalization of the k-means problem.

### Local Search Methods for k-Means with Outliers

- Computer ScienceProc. VLDB Endow.
- 2017

This work proposes a simple local search-based algorithm for k-means clustering with outliers and proves that this algorithm achieves constant-factor approximate solutions and can be combined with known sketching techniques to scale to large data sets.

### Distributed k-Clustering for Data with Heavy Noise

- Computer ScienceNeurIPS
- 2018

The number of outliers is improved to the best possible $(1+\epsilon)z$, while maintaining the $O(1)$-approximation ratio and independence of communication cost on $z$.

### Data reduction for weighted and outlier-resistant clustering

- Computer ScienceSODA
- 2012

The essential challenge that arises in these optimization problems is data reduction for the weighted k-median problem, and this work solves this problem, which was previously solved only in one dimension ([Har-Peled FSTTCS' 06], [Feldman, Fiat and Sharir FOCS'06]).

### Distributed k-means and k-median clustering on general communication topologies

- Computer ScienceNIPS
- 2013

A distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies is provided.

### Optimal Time Bounds for Approximate Clustering

- Computer ScienceMachine Learning
- 2004

Using successive sampling, an algorithm is developed for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal.

### Distributed Partial Clustering

- Computer ScienceSPAA
- 2017

This paper develops the first algorithms for the partial k-median and means objectives that run in subquadratic running time and initiates the study of distributed algorithms for clustering uncertain data, where each data point can possibly fall into multiple locations under certain probability distribution.

### Fast clustering using MapReduce

- Computer ScienceKDD
- 2011

This paper designs clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets, and focuses on the practical and popular clustering problems, k-center and k-median.

### Communication-Optimal Distributed Clustering

- Computer ScienceNIPS
- 2016

This work highlights the surprising power of a broadcast channel for clustering problems; roughly speaking, to cluster n points or n vertices in a graph distributed across s servers, for a worst-case partitioning the communication complexity in a point-to-point model is n*, while in the broadcast model it is n + s.

### Improved Distributed Principal Component Analysis

- Computer ScienceNIPS
- 2014

New algorithms and analyses for distributed PCA are given which lead to improved communication and computational costs for k-means clustering and related problems, and a speedup of orders of magnitude is shown on real world data.