Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

@article{Huang2004ExtensionsTT,
  title={Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values},
  author={Joshua Zhexue Huang},
  journal={Data Mining and Knowledge Discovery},
  year={2004},
  volume={2},
  pages={283-304}
}
  • J. Huang
  • Published 2004
  • Mathematics, Computer Science
  • Data Mining and Knowledge Discovery
The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters… Expand
Cluster Analysis on Different Data Sets Using K-Modes and K-Prototype Algorithms
The k-means algorithm is well-known for its efficiency in clustering large data sets and it is restricted to the numerical data types. But the real world is a mixture of various data typed objects.Expand
A method for k-means-like clustering of categorical data
TLDR
A novel extension of k-means method for clustering categorical data is developed, making use of an information theoretic-based dissimilarity measure and a kernel-based method for representation of cluster means for categorical objects. Expand
An alternative extension of the k-means algorithm for clustering categorical data
Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points.Expand
Clustering Categorical Data Using the K-Means Algorithm and the Attribute’s Relative Frequency
TLDR
The proposed approach is compared with a previously method based on transforming the categorical datasets into binary values and shows that the proposed method outperforms the binary method in all cases. Expand
An iterative initial-points refinement algorithm for categorical data clustering
TLDR
Experiments show that the k-modes clustering algorithm using refined initial points leads to higher precision results much more reliably than the random selection method without refinement, thus making the refinement process applicable to many data mining applications with categorical data. Expand
Comparing K-Value Estimation for Categorical and Numeric Data Clustring
TLDR
Heuristic novel techniques are used for conversion and comparing the categorical data with numeric data and the Gmeans algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. Expand
A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure
TLDR
A new dissimilarity measure based on an information theoretic definition of similarity that considers the amount of information of two values in the domain set is proposed that automatically measures the contribution of individual attributes for the clusters. Expand
Clustering Algorithm for Incomplete Data Sets with Mixed Numeric and Categorical Attributes
TLDR
An improved k-prototypes algorithm is proposed in this paper, which employs a new dissimilarity measure for incomplete data set with mixed numeric and categorical attributes and a new approach to select k objects as the initial prototypes based on the nearest neighbors. Expand
An improved k-prototypes clustering algorithm for mixed numeric and categorical data
TLDR
An improved k-prototypes algorithm to cluster mixed data is proposed, and a new measure to calculate the dissimilarity between data objects and prototypes of clusters is proposed that takes into account the significance of different attributes towards the clustering process. Expand
A dissimilarity measure for the k-Modes clustering algorithm
TLDR
The results of comparative experiments show the effectiveness of the new dissimilarity measure for the k-Modes algorithm, especially on data sets with biological and genetic taxonomy information, and indicates that it can be effectively used for large data sets. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 40 REFERENCES
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
TLDR
This paper presents an algorithm, called k-modes, to extend the k-means paradigm to categorical domains, which introduces new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustered cost function. Expand
CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES
TLDR
A k-prototypes algorithm which is based on the k-means paradigm but removes the numeric data limitation whilst preserving its efficiency, and uses decision tree induction algorithms to create rules for clusters. Expand
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
TLDR
DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it. Expand
BIRCH: an efficient data clustering method for very large databases
TLDR
A data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is presented, and it is demonstrated that it is especially suitable for very large databases. Expand
e-Means Clustering with the I1 and I, Norms
Abstruct- An extension of the hard and fuzzy c-means (HCM/FCM) clustering algorithms is described. Specifically, these models are extended to admit the case where the (dis)similarity measure on pairsExpand
A conceptual version of the K-means algorithm
TLDR
A hybrid numericsymbolic method that integrates an extended version of the K-means algorithm for cluster determination and a complementary conceptual characterization algorithm for clusters description is proposed. Expand
Symbolic clustering using a new dissimilarity measure
TLDR
A new dissimilarity measure, based on “position”, “span” and “content” of symbolic objects is proposed for symbolic clustering, and the results of the application of the algorithm on numeric data of known number of classes are described first to show the efficacy of the method. Expand
An examination of procedures for determining the number of clusters in a data set
A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters. ToExpand
An algorithm for generating artificial test clusters
An algorithm for generating artificial data sets which contain distinct nonoverlapping clusters is presented. The algorithm is useful for generating test data sets for Monte Carlo validation researchExpand
Validity studies in clustering methodologies
TLDR
This paper provides a semi-tutorial review of the state-of-the-art in cluster validity, or the verification of results from clustering algorithms, and covers ways of measuring clustering tendency, the fit of hierarchical and partitional structures and indices of compactness and isolation for individual clusters. Expand
...
1
2
3
4
...