Accelerating exact k-means algorithms with geometric reasoning

@inproceedings{Pelleg1999AcceleratingEK,
  title={Accelerating exact k-means algorithms with geometric reasoning},
  author={Dan Pelleg and Andrew W. Moore},
  booktitle={KDD '99},
  year={1999}
}
Abstract : We present new algorithms for the k-means clustering problem. They use the kd-tree data structure to reduce the large number of nearest-neighbor queries issued by the traditional algorithm. Sufficient statistics are stored in the nodes of the kd-tree. Then an analysis of the geometry of the current cluster centers results in great reduction of the work needed to update the centers. Our algorithms behave exactly as the traditional k-means algorithm. Proofs of correctness are included… 

Figures and Tables from this paper

Accelerating Lloyd’s Algorithm for k -Means Clustering
TLDR
This chapter surveys some of the optimizations to speed up Lloyd’s algorithm and presents new algorithms which avoid distance calculations by the triangle inequality, which can run many times faster and compute far fewer distances than the standard unoptimized implementation.
2 The Standard k-Means Algorithm does a Lot of Unnecessary Work
TLDR
This chapter surveys some of the optimizations to speed up Lloyd’s algorithm and presents new algorithms which avoid distance calculations by the triangle inequality, which can run many times faster and compute far fewer distances than the standard unoptimized implementation.
Geometric methods to accelerate k-means algorithms
TLDR
These methods give tighter lower bound updates, efficiently skip centroids that cannot possibly be close to a set of points, keep extra information about upper bounds to help the heap algorithm avoid more distance computations, and decrease the number of distance calculations that are done in the first iteration.
Chapter 2 Accelerating Lloyd ’ s Algorithm for k-Means Clustering
TLDR
This chapter surveys some of the optimizations to speed up Lloyd’s algorithm and presents new algorithms which avoid distance calculations by the triangle inequality, which can run many times faster and compute far fewer distances than the standard unoptimized implementation.
A local search approximation algorithm for k-means clustering
TLDR
This work considers the question of whether there exists a simple and practical approximation algorithm for k-means clustering, and presents a local improvement heuristic based on swapping centers in and out that yields a (9+ε)-approximation algorithm.
A Dual-Tree Algorithm for Fast k-means Clustering With Large k
TLDR
A dual-tree algorithm that gives the exact same results as standard kmeans when using cover trees, and bound the single-iteration runtime of the algorithm as O(N + k log k), under some assumptions, which are the first subO(kN) bounds for exact Lloyd iterations.
Fast and exact out-of-core k-means clustering
TLDR
This paper presents a new algorithm which typically requires only one or a small number of passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm.
An Efficient k-Means Clustering Algorithm: Analysis and Implementation
TLDR
This work presents a simple and efficient implementation of Lloyd's k-means clustering algorithm, which it calls the filtering algorithm, and establishes the practical efficiency of the algorithm's running time.
Space Partitioning for Scalable K-Means
TLDR
The proposed space partitioning approach has shown to overcome the well-known limitation of KD-Trees in high-dimensional spaces and can also be adopted to improve the efficiency of other algorithms in which KD-trees have been used.
Scalable K-Means by ranked retrieval
TLDR
This paper shows how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques, and proposes a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
Refining Initial Points for K-Means Clustering
TLDR
A procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution that allows the iterative algorithm to converge to a “better” local minimum.
A Database Interface for Clustering in Large Spatial Databases
TLDR
This paper presents an interface to the database management system (DBMS) based on a spatial access method, the R*-tree, which is crucial for the efficiency of KDD on large databases and proposes a method for spatial data sampling as part of the focusing component, significantly reducing the number of objects to be clustered.
Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets
TLDR
A very sparse data structure, the ADtree, is provided to minimize memory use and it is empirically demonstrated that tractably-sized data structures can be produced for large real-world datasets by using a sparse tree structure that never allocates memory for counts of zero.
BIRCH: an efficient data clustering method for very large databases
TLDR
A data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is presented, and it is demonstrated that it is especially suitable for very large databases.
Multidimensional divide-and-conquer
TLDR
Multidimensional divide-and-conquer is discussed, an algorithmic paradigm that can be instantiated in many different ways to yield a number of algorithms and data structures for multidimensional problems.
Cached Suucient Statistics for Eecient Machine Learning with Large Datasets
TLDR
This paper introduces new algorithms and data structures for quick counting for machine learning datasets, and empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by using a sparse tree structure that never allocates memory for counts of zero.
Eecient and Eeective Clustering Methods for Spatial Data Mining
TLDR
A new clustering method called CLARANS is developed which is based on randomized search and shown to be the most effective in spatial data mining and to compare the performance of existing clustering methods show that CLarANS is the most e cient.
Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees
TLDR
A new algorithm is presented, based on the multiresolution kd-trees of [5], which dramatically reduces the cost of EM-based clustering, with savings rising linearly with the number of datapoints.
Efficient and Effective Clustering Methods for Spatial Data Mining
TLDR
The analysis and experiments show that with the assistance of CLAHANS, these two algorithms are very effective and can lead to discoveries that are difficult to find with current spatial data mining algorithms.
Multiresolution Instance-Based Learning
TLDR
A new way of structuring a database and a new algorithm for accessing it is presented and evaluated that maintains the advantages ot instance-based learning and permits the same flexibility as a conventional linear search but at greatly reduced computational cost.
...
1
2
3
4
...