Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

@article{Huang2004ExtensionsTT,
  title={Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values},
  author={J. Huang},
  journal={Data Mining and Knowledge Discovery},
  year={2004},
  volume={2},
  pages={283-304}
}
  • J. Huang
  • Published 2004
  • Mathematics, Computer Science
  • Data Mining and Knowledge Discovery
  • The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters… CONTINUE READING
    1,964 Citations
    Clustering Categorical Data Using the K-Means Algorithm and the Attribute’s Relative Frequency
    • 5
    Clustering Algorithm for Incomplete Data Sets with Mixed Numeric and Categorical Attributes
    • 7
    • PDF
    A dissimilarity measure for the k-Modes clustering algorithm
    • 95
    • PDF

    References

    SHOWING 1-10 OF 41 REFERENCES
    A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
    • 534
    • PDF
    CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES
    • 451
    • Highly Influential
    • PDF
    A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
    • 15,128
    • PDF
    BIRCH: an efficient data clustering method for very large databases
    • 4,581
    • PDF
    e-Means Clustering with the I1 and I, Norms
    • 40
    • Highly Influential
    A conceptual version of the K-means algorithm
    • 233
    Symbolic clustering using a new dissimilarity measure
    • 316
    An examination of procedures for determining the number of clusters in a data set
    • 2,720
    An algorithm for generating artificial test clusters
    • 146
    Validity studies in clustering methodologies
    • 308