# Accelerating exact k-means algorithms with geometric reasoning

@inproceedings{Pelleg1999AcceleratingEK, title={Accelerating exact k-means algorithms with geometric reasoning}, author={Dan Pelleg and Andrew W. Moore}, booktitle={KDD '99}, year={1999} }

Abstract : We present new algorithms for the k-means clustering problem. They use the kd-tree data structure to reduce the large number of nearest-neighbor queries issued by the traditional algorithm. Sufficient statistics are stored in the nodes of the kd-tree. Then an analysis of the geometry of the current cluster centers results in great reduction of the work needed to update the centers. Our algorithms behave exactly as the traditional k-means algorithm. Proofs of correctness are included…

## 392 Citations

Accelerating Lloyd’s Algorithm for k -Means Clustering

- Computer Science
- 2015

This chapter surveys some of the optimizations to speed up Lloyd’s algorithm and presents new algorithms which avoid distance calculations by the triangle inequality, which can run many times faster and compute far fewer distances than the standard unoptimized implementation.

2 The Standard k-Means Algorithm does a Lot of Unnecessary Work

- Computer Science
- 2017

This chapter surveys some of the optimizations to speed up Lloyd’s algorithm and presents new algorithms which avoid distance calculations by the triangle inequality, which can run many times faster and compute far fewer distances than the standard unoptimized implementation.

Geometric methods to accelerate k-means algorithms

- Computer ScienceSDM
- 2016

These methods give tighter lower bound updates, efficiently skip centroids that cannot possibly be close to a set of points, keep extra information about upper bounds to help the heap algorithm avoid more distance computations, and decrease the number of distance calculations that are done in the first iteration.

Chapter 2 Accelerating Lloyd ’ s Algorithm for k-Means Clustering

- Computer Science
- 2017

This chapter surveys some of the optimizations to speed up Lloyd’s algorithm and presents new algorithms which avoid distance calculations by the triangle inequality, which can run many times faster and compute far fewer distances than the standard unoptimized implementation.

A local search approximation algorithm for k-means clustering

- Computer ScienceSCG '02
- 2002

This work considers the question of whether there exists a simple and practical approximation algorithm for k-means clustering, and presents a local improvement heuristic based on swapping centers in and out that yields a (9+ε)-approximation algorithm.

A Dual-Tree Algorithm for Fast k-means Clustering With Large k

- Computer ScienceSDM
- 2017

A dual-tree algorithm that gives the exact same results as standard kmeans when using cover trees, and bound the single-iteration runtime of the algorithm as O(N + k log k), under some assumptions, which are the first subO(kN) bounds for exact Lloyd iterations.

Fast and exact out-of-core k-means clustering

- Computer ScienceFourth IEEE International Conference on Data Mining (ICDM'04)
- 2004

This paper presents a new algorithm which typically requires only one or a small number of passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm.

An Efficient k-Means Clustering Algorithm: Analysis and Implementation

- Computer ScienceIEEE Trans. Pattern Anal. Mach. Intell.
- 2002

This work presents a simple and efficient implementation of Lloyd's k-means clustering algorithm, which it calls the filtering algorithm, and establishes the practical efficiency of the algorithm's running time.

Space Partitioning for Scalable K-Means

- Computer Science2010 Ninth International Conference on Machine Learning and Applications
- 2010

The proposed space partitioning approach has shown to overcome the well-known limitation of KD-Trees in high-dimensional spaces and can also be adopted to improve the efficiency of other algorithms in which KD-trees have been used.

Scalable K-Means by ranked retrieval

- Computer ScienceWSDM
- 2014

This paper shows how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques, and proposes a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.

## References

SHOWING 1-10 OF 33 REFERENCES

Refining Initial Points for K-Means Clustering

- Computer ScienceICML
- 1998

A procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution that allows the iterative algorithm to converge to a “better” local minimum.

A Database Interface for Clustering in Large Spatial Databases

- Computer ScienceKDD
- 1995

This paper presents an interface to the database management system (DBMS) based on a spatial access method, the R*-tree, which is crucial for the efficiency of KDD on large databases and proposes a method for spatial data sampling as part of the focusing component, significantly reducing the number of objects to be clustered.

Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

- Computer ScienceJ. Artif. Intell. Res.
- 1998

A very sparse data structure, the ADtree, is provided to minimize memory use and it is empirically demonstrated that tractably-sized data structures can be produced for large real-world datasets by using a sparse tree structure that never allocates memory for counts of zero.

BIRCH: an efficient data clustering method for very large databases

- Computer ScienceSIGMOD '96
- 1996

A data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is presented, and it is demonstrated that it is especially suitable for very large databases.

Multidimensional divide-and-conquer

- Computer ScienceCACM
- 1980

Multidimensional divide-and-conquer is discussed, an algorithmic paradigm that can be instantiated in many different ways to yield a number of algorithms and data structures for multidimensional problems.

Cached Suucient Statistics for Eecient Machine Learning with Large Datasets

- Computer Science
- 1997

This paper introduces new algorithms and data structures for quick counting for machine learning datasets, and empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by using a sparse tree structure that never allocates memory for counts of zero.

Eecient and Eeective Clustering Methods for Spatial Data Mining

- Computer Science
- 1994

A new clustering method called CLARANS is developed which is based on randomized search and shown to be the most effective in spatial data mining and to compare the performance of existing clustering methods show that CLarANS is the most e cient.

Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees

- Computer ScienceNIPS
- 1998

A new algorithm is presented, based on the multiresolution kd-trees of [5], which dramatically reduces the cost of EM-based clustering, with savings rising linearly with the number of datapoints.

Efficient and Effective Clustering Methods for Spatial Data Mining

- Computer ScienceVLDB
- 1994

The analysis and experiments show that with the assistance of CLAHANS, these two algorithms are very effective and can lead to discoveries that are difficult to find with current spatial data mining algorithms.

Multiresolution Instance-Based Learning

- Computer ScienceIJCAI
- 1995

A new way of structuring a database and a new algorithm for accessing it is presented and evaluated that maintains the advantages ot instance-based learning and permits the same flexibility as a conventional linear search but at greatly reduced computational cost.