Model-Based Clustering, Discriminant Analysis, and Density Estimation

  title={Model-Based Clustering, Discriminant Analysis, and Density Estimation},
  author={Chris Fraley and Adrian E. Raftery},
  journal={Journal of the American Statistical Association},
  pages={611 - 631}
  • C. FraleyA. Raftery
  • Published 1 June 2002
  • Computer Science
  • Journal of the American Statistical Association
Cluster analysis is the automated search for groups of related observations in a dataset. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures, and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be… 

Methods for Clustering Data with Missing Values

An algorithm that utilises marginal multivariate Gaussian densities for assignment probabilities, was developed and tested versus more conventional ways of model-based clustering for incomplete data and found that for cases with many observations, the complete case and multiple imputation have advantages over the marginal density method.

Recent Developments in Model-Based Clustering with Applications

The latest developments in model-based clustering including semi-supervised clustering, non-parametric mixture modeling, choice of initialization strategies, merging mixture components for clusters, handling spurious solutions, and assessing variability of obtained partitions are reviewed.

Clustering data with measurement errors

A generalized Bayes framework for probabilistic clustering

A generalized Bayes framework is proposed that bridges between these paradigms through the use of Gibbs posteriors, and provides a method of uncertainty quantification for these approaches; for example, allowing calculation of the probability a data point is well clustered.

Model-Based Clustering With Dissimilarities: A Bayesian Approach

The method carries out multidimensional scaling and model-based clustering simultaneously, and yields good object configurations and good clustering results with reasonable measures of clustering uncertainties, and can be used as a tool for dimension reduction when clustering high-dimensional objects.

Robust EM algorithm for model-based curve clustering

  • Faicel Chamroukhi
  • Computer Science
    The 2013 International Joint Conference on Neural Networks (IJCNN)
  • 2013
The approach both handles the problem of initialization and the one of choosing the optimal number of clusters as the EM learning proceeds, rather than in a twofold scheme, by optimizing a penalized log-likelihood criterion.

A Population Background for Nonparametric Density-Based Clustering

It is shown that only mild conditions on a sequence of density estimators are needed to ensure that the sequence of modal clusterings that they induce is consistent and two new loss functions are presented, applicable in fact to any clustering methodology, to evaluate the performance of a data-based clustering algorithm with respect to the ideal population goal.

Model-based Clustering with Dissimilarities : A Bayesian Approach 1

The method carries out multidimensional scaling and model-based clustering simultaneously, and yields good object configurations and good clustering results with reasonable measures of clustering uncertainties, and can be used as a tool for dimension reduction when clustering high-dimensional objects.

Fast clustering using adaptive density peak detection

This paper proposes a clustering procedure with adaptive density peak detection, where the local density is estimated through the nonparametric multivariate kernel estimation and develops an automatic cluster centroid selection method through maximizing an average silhouette index.

A Bayesian Predictive Model for Clustering Data of Mixed Discrete and Continuous Type

This paper introduces a model-based approach for clustering feature vectors of mixed type, allowing each feature to simultaneously take on both categorical and real values.



Model-based clustering and data transformations for gene expression data

The model-based approach has superior performance on synthetic data sets, consistently selecting the correct model and the number of clusters, and the validity of the Gaussian mixture assumption on different transformations of real data is explored.

How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis

The problems of determining the number of clusters and the clustering method are solved simultaneously by choosing the best model, and the EM result provides a measure of uncertainty about the associated classification of each data point.

Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees

A new algorithm is presented, based on the multiresolution kd-trees of [5], which dramatically reduces the cost of EM-based clustering, with savings rising linearly with the number of datapoints.

Hierarchical Model-Based Clustering for Large Datasets

This article proposes to start the hierarchical agglomeration from an efficient classification of the data in many classes rather than from the usual set of singleton clusters, and develops graphical tools that assess the presence of clusters in the data and uncover observations difficult to classify.

Inference in model-based cluster analysis

This work proposes a new approach to cluster analysis which consists of exact Bayesian inference via Gibbs sampling, and the calculation of Bayes factors from the output using the Laplace–Metropolis estimator, which works well in several real and simulated examples.

Finding Curvilinear Features in Spatial Point Patterns: Principal Curve Clustering with Noise

The algorithm for principal curve clustering is in two steps: the first is hierarchical and agglomerative (HPCC) and the second consists of iterative relocation based on the classification EM algorithm (CEM-PCC), which is used to combine potential feature clusters and refines the results and deals with background noise.

Gaussian parsimonious clustering models

Probabilistic models in cluster analysis

Model selection for probabilistic clustering using cross-validated likelihood

The cross-validation approach, as well as penalized likelihood and McLachlan's bootstrap method, are applied to two data sets and the results from all three methods are in close agreement.

Principal component analysis for clustering gene expression data

The empirical study showed that clustering with the PCs instead of the original variables does not necessarily improve, and often degrades, cluster quality, and would not recommend PCA before clustering except in special circumstances.