Scalable Model-based Clustering by Working on Data Summaries

  title={Scalable Model-based Clustering by Working on Data Summaries},
  author={Huidong Jin and Man Leung Wong and Kwong-Sak Leung},
The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. In this paper, we present a two-phase scalable model-based clustering framework: First, a large data set is summed up into sub-clusters; Then, clusters are directly generated from the summary statistics of sub-clusters by a specifically designed ExpectationMaximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably… CONTINUE READING