BIRCH: An Efficient Data Clustering Method for Very Large Databases

Abstract

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of <i>clusters,</i> or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named <i>BIRCH</i> (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. <i>BIRCH</i> incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). <i>BIRCH</i> can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. <i>BIRCH</i> is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate <i>BIRCH</i>'s time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of <i>BIRCH</i> versus <i>CLARANS,</i> a clustering method proposed recently for large datasets, and show that <i>BIRCH</i> is consistently superior.

DOI: 10.1145/233269.233324
View Slides

Extracted Key Phrases

12 Figures and Tables

0100200'97'99'01'03'05'07'09'11'13'15'17
Citations per Year

3,394 Citations

Semantic Scholar estimates that this publication has 3,394 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Zhang1996BIRCHAE, title={BIRCH: An Efficient Data Clustering Method for Very Large Databases}, author={Tian Zhang and Raghu Ramakrishnan and Miron Livny}, booktitle={SIGMOD Conference}, year={1996} }