Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of <i>clusters,</i> or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named <i>BIRCH</i> (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. <i>BIRCH</i> incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). <i>BIRCH</i> can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. <i>BIRCH</i> is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate <i>BIRCH</i>'s time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of <i>BIRCH</i> versus <i>CLARANS,</i> a clustering method proposed recently for large datasets, and show that <i>BIRCH</i> is consistently superior.
Unfortunately, ACM prohibits us from displaying non-influential references for this paper.
To see the full reference list, please visit http://dl.acm.org/citation.cfm?id=233324.