• Corpus ID: 43973573

# A Practical Algorithm for Distributed Clustering and Outlier Detection

@article{Chen2018APA,
title={A Practical Algorithm for Distributed Clustering and Outlier Detection},
author={Jiecao Chen and Erfan Sadeqi Azer and Qin Zhang},
journal={ArXiv},
year={2018},
volume={abs/1805.09495}
}
• Published 24 May 2018
• Computer Science
• ArXiv
We study the classic $k$-means/median clustering, which are fundamental problems in unsupervised learning, in the setting where data are partitioned across multiple sites, and where we are allowed to discard a small portion of the data by labeling them as outliers. We propose a simple approach based on constructing small summary for the original dataset. The proposed method is time and communication efficient, has good approximation guarantees, and can identify the global outliers effectively…

## Figures and Tables from this paper

### k-Clustering with Fair Outliers

• Computer Science
WSDM
• 2022
This work studies the problem of k-clustering with fair outlier removal and provides the first approximation algorithm for well-known clustering formulations, such as k-means and k-median, and analyzes this algorithm and proves that it has strong theoretical guarantees.

### Distributed k-Means with Outliers in General Metrics

• Computer Science, Mathematics
ArXiv
• 2022
A distributed coreset-based 3-round approximation algorithm for k-means with z outliers for general metric spaces, using MapReduce as a computational model, that obliviously adapts to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space.

### The Effectiveness of Uniform Sampling for Center-Based Clustering with Outliers

• Computer Science
• 2019
A "significance" criterion is introduced and it is proved that the performance of uniform sampling depends on the significance degree of the given instance, and the first uniform sampling approach that allows to discard exactly exactly $z$ outliers for these three center-based clustering with outliers problems is proposed.

### Is Simple Uniform Sampling Efficient for Center-Based Clustering With Outliers: When and Why?

• Computer Science
ArXiv
• 2021
This paper proposes a simple uniform sampling framework for solving three representative center-based clustering with outliers problems: k -center/median/means clusteringwith outliers, and introduces a measure of “signiﬁcance” and proves that the performance of the framework depends on the signi-cance degree of the given instance.

### Distributed k-Clustering for Data with Heavy Noise

• Computer Science
NeurIPS
• 2018
The number of outliers is improved to the best possible $(1+\epsilon)z$, while maintaining the $O(1)$-approximation ratio and independence of communication cost on $z$.

### Fast Noise Removal for k-Means Clustering

• Computer Science
AISTATS
• 2020

### Data reduction for weighted and outlier-resistant clustering

• Computer Science
SODA
• 2012
The essential challenge that arises in these optimization problems is data reduction for the weighted k-median problem, and this work solves this problem, which was previously solved only in one dimension ([Har-Peled FSTTCS' 06], [Feldman, Fiat and Sharir FOCS'06]).

### Distributed k-means and k-median clustering on general communication topologies

• Computer Science
NIPS
• 2013
A distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies is provided.

### Optimal Time Bounds for Approximate Clustering

• Computer Science
Machine Learning
• 2004
Using successive sampling, an algorithm is developed for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal.

### Distributed Partial Clustering

• Computer Science
SPAA
• 2017
This paper develops the first algorithms for the partial k-median and means objectives that run in subquadratic running time and initiates the study of distributed algorithms for clustering uncertain data, where each data point can possibly fall into multiple locations under certain probability distribution.

### Fast clustering using MapReduce

• Computer Science
KDD
• 2011
This paper designs clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets, and focuses on the practical and popular clustering problems, k-center and k-median.

### Communication-Optimal Distributed Clustering

• Computer Science
NIPS
• 2016
This work highlights the surprising power of a broadcast channel for clustering problems; roughly speaking, to cluster n points or n vertices in a graph distributed across s servers, for a worst-case partitioning the communication complexity in a point-to-point model is n*, while in the broadcast model it is n + s.

### Improved Distributed Principal Component Analysis

• Computer Science
NIPS
• 2014
New algorithms and analyses for distributed PCA are given which lead to improved communication and computational costs for k-means clustering and related problems, and a speedup of orders of magnitude is shown on real world data.