Choosing the number of clusters in Κ-means clustering.

@article{Steinley2011ChoosingTN,
  title={Choosing the number of clusters in Κ-means clustering.},
  author={D. Steinley and M. Brusco},
  journal={Psychological methods},
  year={2011},
  volume={16 3},
  pages={
          285-97
        }
}
Steinley (2007) provided a lower bound for the sum-of-squares error criterion function used in K-means clustering. In this article, on the basis of the lower bound, the authors propose a method to distinguish between 1 cluster (i.e., a single distribution) versus more than 1 cluster. Additionally, conditional on indicating there are multiple clusters, the procedure is extended to determine the number of clusters. Through a series of simulations, the proposed methodology is shown to outperform… Expand
Estimating the number of clusters in a dataset via consensus clustering
Abstract In unsupervised learning, the problem of finding the appropriate number of clusters -usually notated as k- is very challenging. Its importance lies in the fact that k is a vitalExpand
A stability analysis of sparse K-means
TLDR
The methodology is proposed to allow the researcher to use cluster analysis to achieve disease phenotyping and subgroup discovery with confidence that they are uncovering accurate and stable results thus ensuring that their findings will allow reliable public health decisions to be made from their work. Expand
Self-Adjusting Variable Neighborhood Search Algorithm for Near-Optimal k-Means Clustering
TLDR
This article investigates the influence of the most important parameter of randomized neighborhoods formed by the application of greedy agglomerative procedures on the computational efficiency of VNS algorithms and proposes a new VNS-based algorithm (solver), implemented on the graphics processing unit (GPU), which adjusts this parameter. Expand
Weighting Policies for Robust Unsupervised Ensemble Learning
TLDR
This study shows that using the idea of Markowitz portfolio theory will create a partition with a less variation in comparison to traditional consensus clustering and proposed weighted consensus cluster, and aims to optimize the combination of individual clustering methods to minimize the variance of clustering accuracy. Expand
Strict Monotonicity of Sum of Squares Error and Normalized Cut in the Lattice of Clusterings
TLDR
Monotonicity not just on the minimizers but on the entire clustering lattice is studied, showing the value of Sum of Squares Error is strictly monotone under the strict refinement relation of clusterings and data-dependent bounds are obtained on the difference between thevalue of a clustering and one of its refinements. Expand
Finding the Number of Clusters in Data and Better Initial Centers for K-means Algorithm
The k-means is the most well-known algorithm for data clustering in data mining. Its simplicity and speed of convergence to local minima are the most important advantages of it, in addition to itsExpand
K-Means Clustering and Mixture Model Clustering: Reply to McLachlan (2011) and Vermunt (2011)
McLachlan (2011) and Vermunt (2011) each provided thoughtful replies to our original article (Steinley & Brusco, 2011). This response serves to incorporate some of their comments while simultaneouslyExpand
The δ-Machine: Classification Based on Distances Towards Prototypes
TLDR
The properties of the δ-machine are discussed, an automatic decision rule for deciding on the number of clusters for the K-means method on the predictive perspective is proposed, and variable importance measures and partial dependence plots for the machine are derived. Expand
Gaussian model‐based partitioning using iterated local search
TLDR
This comparison, which used 23 data sets from the classification literature, revealed that the ILS and hybrid heuristics generally provided better criterion function values than the multistart approach when all three methods were constrained to the same 10-min time limit. Expand
Examining the effect of initialization strategies on the performance of Gaussian mixture modeling
TLDR
Five techniques for obtaining starting values that are implemented in popular software packages are compared and a set of recommendations is provided to the user for selecting the set of starting values. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 39 REFERENCES
Validating Clusters with the Lower Bound for Sum-of-Squares Error
Given that a minor condition holds (e.g., the number of variables is greater than the number of clusters), a nontrivial lower bound for the sum-of-squares error criterion in K-means clustering isExpand
An examination of procedures for determining the number of clusters in a data set
TLDR
The aim of this paper is to compare three methods based on the hypervolume criterion with four other well-known methods for determining the number of clusters on artificial data sets. Expand
Profiling local optima in K-means clustering: developing a diagnostic technique.
  • D. Steinley
  • Mathematics, Medicine
  • Psychological methods
  • 2006
TLDR
By combining the information from several observable characteristics of the data (number of clusters, number of variables, sample size, etc.) with the prevalence of unique local optima in several thousand implementations of the K-means algorithm, the author provides a method capable of guiding key data-analysis decisions. Expand
Model-based Gaussian and non-Gaussian clustering
Abstract : The classification maximum likelihood approach is sufficiently general to encompass many current clustering algorithms, including those based on the sum of squares criterion and on theExpand
A cautionary note on using internal cross validation to select the number of clusters
A highly popular method for examining the stability of a data clustering is to split the data into two parts, cluster the observations in Part A, assign the objects in Part B to their nearestExpand
K-means clustering: a half-century synthesis.
  • D. Steinley
  • Computer Science, Medicine
  • The British journal of mathematical and statistical psychology
  • 2006
TLDR
This paper synthesizes the results, methodology, and research conducted concerning the K-means clustering method over the last fifty years, leading to a unifying treatment of K-Means and some of its extensions. Expand
Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures
Abstract Eight different variable selection techniques for model-based and non-model-based clustering are evaluated across a wide range of cluster structures. It is shown that several methods haveExpand
Evaluating mixture modeling for clustering: recommendations and cautions.
TLDR
Focus is given to the multivariate normal distribution, and 9 separate decompositions (i.e., class structures) of the covariance matrix are investigated, and degraded performance was observed for both K-means clustering and mixture-model clusters. Expand
An algorithm for generating artificial test clusters
An algorithm for generating artificial data sets which contain distinct nonoverlapping clusters is presented. The algorithm is useful for generating test data sets for Monte Carlo validation researchExpand
A Nearest-Centroid Technique for Evaluating the Minimum-Variance Clustering Procedure.
It was posited that a good cluster solution has two characteristics: (1) it is stable across multiple random samples; and (2) its clusters accurately correspond to the populations from which theExpand
...
1
2
3
4
...