Efficient clustering of high-dimensional data sets with application to reference matching

Abstract

Many important problems involve clustering large datasets. Although naive implementations of clustering are computationally expensive, there are established efficient techniques for clustering when the dataset has either (1) a limited number of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of efficiently clustering datasets that are large in all three ways at once—for example, having millions of data points that exist in many thousands of dimensions representing many thousands of clusters. We present a new technique for clustering these large, highdimensional datasets. The key idea involves using a cheap, approximate distance measure to efficiently divide the data into overlapping subsets we call canopies. Then clustering is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical. Under reasonable assumptions about the cheap distance metric, this reduction in computational cost comes without any loss in clustering accuracy. Canopies can be applied to many domains and used with a variety of clustering approaches, including Greedy Agglomerative Clustering, K-means and Expectation-Maximization. We present experimental results on grouping bibliographic citations from the reference sections of research papers. Here the canopy approach reduces computation time over a traditional clustering approach by more than an order of magnitude and decreases error in comparison to a previously used algorithm by 25%.

DOI: 10.1145/347090.347123

Extracted Key Phrases

5 Figures and Tables

050'01'03'05'07'09'11'13'15'17
Citations per Year

923 Citations

Semantic Scholar estimates that this publication has 923 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{McCallum2000EfficientCO, title={Efficient clustering of high-dimensional data sets with application to reference matching}, author={Andrew McCallum and Kamal Nigam and Lyle H. Ungar}, booktitle={KDD}, year={2000} }