An efficient k‐modes algorithm for clustering categorical datasets

@article{Dorman2022AnEK,
  title={An efficient k‐modes algorithm for clustering categorical datasets},
  author={Karin S. Dorman and Ranjan Maitra},
  journal={Statistical Analysis and Data Mining: The ASA Data Science Journal},
  year={2022},
  volume={15},
  pages={83 - 97}
}
  • K. Dorman, R. Maitra
  • Published 6 June 2020
  • Computer Science
  • Statistical Analysis and Data Mining: The ASA Data Science Journal
Mining clusters from data is an important endeavor in many applications. The k‐means method is a popular, efficient, and distribution‐free approach for clustering numerical‐valued data, but does not apply for categorical‐valued observations. The k‐modes method addresses this lacuna by replacing the Euclidean with the Hamming distance and the means with the modes in the k‐means objective function. We provide a novel, computationally efficient implementation of k‐modes, called Optimal Transfer… 

US House Price Prediction Using Two-Stage k-Means Clustering

TLDR
The accuracy and stability of prediction can be simultaneously improved through the two-stage k-means clustering method and that the practicality of this technique can be valid as the size of the data is not sufficient.

A Multicluster Approach to Selecting Initial Sets for Clustering of Categorical Data

TLDR
A Multicluster Approach to Selecting Initial Sets for Clustering of Categorical Data that has greater precision and a better grouping effect than the classical Kmodes algorithm.

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

TLDR
A distribution-free fully-automated syncytial clustering algorithm that can be used with $k-means and other algorithms that is always a top performer in identifying groups with regular and irregular structures in several datasets.

References

SHOWING 1-10 OF 101 REFERENCES

On comparing partitions

Rand (1971) proposed the Rand Index to measure the stability of two partitions of one set of units. Hubert and Arabie (1985) corrected the Rand Index for chance (Adjusted Rand Index). In this paper,

A k-means clustering algorithm

K-modes Clustering

TLDR
It is conjecture that, although in some cases latent class procedures might perform better than K-modes, it could out-perform latent class Procedures in other cases and it is recommended that these two approaches be used as "complementary" procedures in performing cluster analysis.

Initialization of K-modes clustering using outlier detection techniques

On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm

TLDR
The main aim of this paper is to rigorously derive the updating formula of the k-modes clustering algorithm with the new dissimilarity measure and the convergence of the algorithm under the optimization framework.

Model-Based Clustering and Classification for Data Science: With Applications in R

  • S. Shin
  • Computer Science, Mathematics
  • 2020
TLDR
In statistics, data are not regarded as just numbers, but realizations of random elements, so data analysis is essentially the process of uncovering the data generating random elements.

Interpoint Distance Comparisons in Correspondence Analysis

Correspondence anal/sis is a metric technique for finding a spatial representation of data that has particular applicability to the analysis of cross tabulations (or contingency tables). The authors
...