Clustering Categorical Data via Ensembling Dissimilarity Matrices

@article{Amiri2015ClusteringCD,
  title={Clustering Categorical Data via Ensembling Dissimilarity Matrices},
  author={Saeid Amiri and Bertrand S. Clarke and Jennifer Clarke},
  journal={Journal of Computational and Graphical Statistics},
  year={2015},
  volume={27},
  pages={195 - 208}
}
ABSTRACT We present a technique for clustering categorical data by generating many dissimilarity matrices and combining them. We begin by demonstrating our technique on low-dimensional categorical data and comparing it to several other techniques that have been proposed. We show through simulations and examples that our method is both more accurate and more stable. Then we give conditions under which our method should yield good results in general. Our method extends to high-dimensional… 
EnsCat: clustering of categorical data via ensembling
TLDR
Ensemble clustering, as implemented in R and called EnsCat, gives more clearly separated clusters than other clustering techniques for categorical data.
Optimal transport, mean partition, and uncertainty assessment in cluster analysis
TLDR
A new algorithm to enhance clustering by any baseline method using bootstrap samples is proposed and a covering point set for each cluster, a concept kin to the confidence interval, is proposed to address the crucial question of whether any cluster is an intrinsic or spurious pattern.
A new method for weighted ensemble clustering and coupled ensemble selection
TLDR
Clustering ensemble, also referred to as consensus clustering, has emerged as a method of combining an ensemble of different clusterings to derive a final clustering that is of better quality and more consistent with known clustering techniques.
Machine Learning Random Forest Cluster Analysis for Large Overfitting Data: using R Programming
  • Yagyanath Rimal
  • Computer Science
    2019 6th International Conference on Computing for Sustainable Global Development (INDIACom)
  • 2019
This review article clearly discusses machine learning random forest clustering analysis for large over fitted data using R Programming which has been sufficiently explained with sampled data to
Bootstrap ClustGeo with spatial constraints Bootstrap ClustGeo con vincoli spaziali
The aim of this paper is to introduce a new statistical procedure for clustering spatial data when an high number of covariates is considered. In particular, this procedure is obtained by coupling
Cluster Analysis of Mixed and Missing Chronic Kidney Disease Data in KwaZulu-Natal Province, South Africa
TLDR
The results show that advanced imputation methods like multiple imputation, which take into consideration the uncertainty inherent in imputations, should be explored when clustering missing datasets, and that the Ahmad-Dey distance measure consistently outperformed Gower’s distance on the mixed and missing dataset.
Search for relevant subsets of binary predictors in high dimensional regression for discovering the lead molecule
TLDR
This work studies the relationship between molecular properties and its fragment composition by building a regression model, in which predictors, represented by binary variables indicating the presence or absence of fragments, are grouped in subsets and a bi‐level penalization term is introduced for the high dimensionality of the problem.
Detecting organized eCommerce fraud using scalable categorical clustering
TLDR
A novel solution to detect organized fraud by analyzing orders in bulk based on clustering that detects 26.2% of fraud while raising false alarms for only 0.1% of legitimate orders is proposed.
...
1
2
...

References

SHOWING 1-10 OF 72 REFERENCES
ROCK: a robust clustering algorithm for categorical attributes
TLDR
This work develops a robust hierarchical clustering algorithm, ROCK, that employs links and not distances when merging clusters, and shows that ROCK not only generates better quality clusters than traditional algorithms, but also exhibits good scalability properties.
Clustering Categorical Data Based on Distance Vectors
TLDR
Comparisons with two well-known clustering algorithms, K-modes and AutoClass, show that the proposed algorithm substantially outperforms these competitors, with the classification rate or the information gain typically improved by several orders of magnitude.
A Link-Based Cluster Ensemble Approach for Categorical Data Clustering
TLDR
Experimental results suggest that the proposed link-based method almost always outperforms both conventional clustering algorithms for categorical data and well-known cluster ensemble techniques.
A General Hybrid Clustering Technique
TLDR
A hybrid clustering technique is used to produce a series of clusterings of various sizes and the key step in this stage is to find a K-means clustering using clusters where and then join these small clusters by using single linkage clusters.
Hierarchical Density-Based Clustering of Categorical Data and a Simplification
TLDR
The HIERDENC algorithm for hierarchical density-based clustering of categorical data offers a basis for designing simpler clustering algorithms that balance the tradeoff of accuracy and speed and a faster simplification of HIerDENC called the MULIC algorithm is presented.
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
  • J. Huang
  • Computer Science
    Data Mining and Knowledge Discovery
  • 2004
TLDR
Two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values are presented and are shown to be efficient when clustering large data sets, which is critical to data mining applications.
A novel attribute weighting algorithm for clustering high-dimensional categorical data
CACTUS—clustering categorical data using summaries
TLDR
This paper introduces a novel formalization of a cluster for categorical attributes by generalizing a definition of a clusters for numerical attributes and describes a very fast summarizationbased algorithm called CACTUS that discovers exactly such clusters in the data.
A cluster centers initialization method for clustering categorical data
...
1
2
3
4
5
...