How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis

  title={How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis},
  author={Chris Fraley and Adrian E. Raftery},
  journal={Comput. J.},
We consider the problem of determining the structure of clustered data, without prior knowledge of the number of clusters or any other information about their composition. Data are represented by a mixture model in which each component corresponds to a different cluster. Models with varying geometric properties are obtained through Gaussian components with different parametrizations and cross-cluster constraints. Noise and outliers can be modelled by adding a Poisson process component… 

Figures and Tables from this paper

Genetic Algorithms for Subset Selection in Model-Based Clustering

The problem of subset selection is recast as a model comparison problem, and BIC is used to approximate Bayes factors, and the criterion proposed is based on the BIC difference between a candidate clustering model for the given subset and a model which assumes no clustering for the same subset.

Bayesian estimation of membership uncertainty in model‐based clustering

It is demonstrated that model‐based clustering gives much better performance for overlapping clusters, a more reliable determination of the number of clusters in data, and better identification of clustering in the presence of outliers than agglomerative hierarchical clustering or iterative relocation clustering using a K‐means criterion.


A model-based approach to cluster analysis is presented, as opposed to the mechanical classi…cation used in deterministic clustering, which regard observations as outcomes of di¤erent distributions.

Methods for Clustering Data with Missing Values

An algorithm that utilises marginal multivariate Gaussian densities for assignment probabilities, was developed and tested versus more conventional ways of model-based clustering for incomplete data and found that for cases with many observations, the complete case and multiple imputation have advantages over the marginal density method.

Assessment and pruning of hierarchical model based clustering

A new clustering method is proposed that can be regarded as a hybrid between model-based and nonparametric clustering, and the hybrid clustering algorithm prunes the cluster tree generated by hierarchical model- based clustering.

Combining Mixture Components for Clustering

  • J. BaudryA. RafteryG. CeleuxKenneth LoR. Gottardo
  • Computer Science
    Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America
  • 2010
This paper proposes first selecting the total number of Gaussian mixture components, K, using BIC and then combining them hierarchically according to an entropy criterion, which yields a unique soft clustering for each number of clusters less than or equal to K.

clusterBMA: Bayesian model averaging for clustering

Various methods have been developed to combine inference across multiple sets of results for unsupervised clustering, within the ensemble and consensus clustering literature. The approach of

On Comparing the Clustering of Regression Models Method with K-means Clustering

It is shown that the two clustering methods, CORM and K-means, can both be considered as solutions to a least squares problem with missing data but they each concern a different type of least squares.

Integrated classification likelihood for model selection in block clustering

A criterion based on an approximation of the integrated classification likelihood (ICL) of block models is developed, and a BIC-like criterion derived from the form obtained is proposed.

Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion

This newly defined clustering method is aimed at overcoming the so-called " equal-size " problem associated with the k-means method, while maintaining its advantage of computational simplicity.



Inference in model-based cluster analysis

This work proposes a new approach to cluster analysis which consists of exact Bayesian inference via Gibbs sampling, and the calculation of Bayes factors from the output using the Laplace–Metropolis estimator, which works well in several real and simulated examples.

Robust Cluster Analysis via Mixtures of Multivariate t-Distributions

The expectation-maximization (EM) algorithm can be used to fit mixtures of multivariate t-distributions by maximum likelihood and it is demonstrated how the use of t-components provides less extreme estimates of the posterior probabilities of cluster membership.

Model-based Gaussian and non-Gaussian clustering

The classification maximum likelihood approach is sufficiently general to encompass many current clustering algorithms, including those based on the sum of squares criterion and on the criterion of Friedman and Rubin (1967), but it is restricted to Gaussian distributions and it does not allow for noise.

Gaussian parsimonious clustering models

Choosing the Number of Component Clusters in the Mixture-Model Using a New Informational Complexity Criterion of the Inverse-Fisher Information Matrix

The informational complexity (ICOMP) criterion of IFIM of this author is derived and proposed as a new criterion for choosing the number of clusters in the mixture-model and the significance of ICOMP is illustrated.

Algorithms for Model-Based Gaussian Hierarchical Clustering

  • C. Fraley
  • Computer Science
    SIAM J. Sci. Comput.
  • 1998
It is shown how the structure of the Gaussian model can be exploited to yield efficient algorithms for agglomerative hierarchical clustering.

Principal Curve Clustering With Noise

The algorithm for principal curve clustering is in two steps hierarchical and agglomerative HPCC and the second consists of iterative relocation based on the Clas si cation EM algorithm.

9 The classification and mixture maximum likelihood approaches to cluster analysis

  • G. McLachlan
  • Mathematics
    Classification, Pattern Recognition and Reduction of Dimensionality
  • 1982

Autoclass — A Bayesian Approach to Classification

A Bayesian approach to the unsupervised discovery of classes in a set of cases, sometimes called finite mixture separation or clustering, which allows direct comparison of alternate density functions that differ in number of classes and/or individual class density functions.