Corpus ID: 6620859

A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining

@inproceedings{Huang1997AFC,
  title={A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining},
  author={J. Huang},
  booktitle={DMKD},
  year={1997}
}
  • J. Huang
  • Published in DMKD 1997
  • Computer Science
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. [...] Key Method We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health…Expand
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
  • J. Huang
  • Mathematics, Computer Science
  • Data Mining and Knowledge Discovery
  • 2004
TLDR
Two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values are presented and are shown to be efficient when clustering large data sets, which is critical to data mining applications. Expand
An Efficient Modified K-Means Algorithm To Cluster Large Data-set In Data Mining
TLDR
An efficient modified K-mean clustering algorithm to cluster large data-sets whose objective is to find out the cluster centers which are very close to the final solution for each iterative steps is proposed. Expand
Applications of clustering algorithms and self organizing maps as data mining and business intelligence tools on real world data sets
TLDR
Comparisons among some nonhierarchical and hierarchical clustering algorithms including SOM (Self-Organization Map) neural networks methods show that the SOM clustering with respect to k means & hierarchical clusters algorithm is scalable in terms of both the number of clusters and thenumber of records. Expand
K-Distributions: A New Algorithm for Clustering Categorical Data
TLDR
A new algorithm called K-distributions is presented, which significantly outperforms K-modes in terms of clustering accuracy and log likelihood and is presented as a stand-alone algorithm for categorical domains. Expand
Design and analysis of clustering algorithms for numerical, categorical and mixed data
TLDR
The purpose of this research is to design and analyse clustering algorithms for numerical, categorical and mixed data sets, and a main part of this thesis is devoted to normalisation. Expand
A matching based clustering algorithm for categorical data
TLDR
A new framework for partitioning categorical data, which does not use the distance measure as a key concept is presented and the Matching based clustering algorithm is designed based on the similarity matrix and a framework for updating the latter using the feature importance criteria. Expand
A novel attribute weighting algorithm for clustering high-dimensional categorical data
TLDR
A novel weighting technique for categorical data is developed to calculate two weights for each attribute (or dimension) in each cluster and use the weight values to identify the subsets of important attributes that categorize different clusters. Expand
On the Consequence of Variation Measure in K-modes Clustering Algorithm
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning1 Clustering is one of the most important data mining techniques that partitions dataExpand
Categorical Data Clustering Using the Combinations of Attribute Values
TLDR
A new clustering algorithm for categorical data that is based on the frequency of attribute value combinations is proposed, which finds all the combinations of attribute values in a record, which represent a subset of all the attribute values, and then groups the records using thefrequency of these combinations. Expand
A New Clustering Algorithm of Hybrid Data According to Weights of Attributes
TLDR
This paper introduces an algorithm which has been improved for the clustering of large hybrid data in an effective way that also includes the weights of attributes, mainly based on the K-Prototypes algorithm. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 29 REFERENCES
CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES
TLDR
A k-prototypes algorithm which is based on the k-means paradigm but removes the numeric data limitation whilst preserving its efficiency, and uses decision tree induction algorithms to create rules for clusters. Expand
Symbolic clustering using a new dissimilarity measure
TLDR
A new dissimilarity measure, based on “position”, “span” and “content” of symbolic objects is proposed for symbolic clustering, and the results of the application of the algorithm on numeric data of known number of classes are described first to show the efficacy of the method. Expand
Automated Construction of Classifications: Conceptual Clustering Versus Numerical Taxonomy
  • R. Michalski, R. Stepp
  • Computer Science, Medicine
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 1983
TLDR
A method for automated construction of classifications called conceptual clustering is described and compared to methods used in numerical taxonomy, in which descriptive concepts are conjunctive statements involving relations on selected object attributes and optimized according to an assumed global criterion of clustering quality. Expand
A conceptual version of the K-means algorithm
TLDR
A hybrid numericsymbolic method that integrates an extended version of the K-means algorithm for cluster determination and a complementary conceptual characterization algorithm for clusters description is proposed. Expand
A clustering technique for summarizing multivariate data.
TLDR
A practical computing method termed ISODATA, which finds the cluster structure of such data, is described and provides a fit to the data of a set of cluster centers that tends to minimize the sum of the squared distances of each data point from its closest cluster center. Expand
SPRINT: A Scalable Parallel Classifier for Data Mining
TLDR
A new decision-tree-based classification algorithm, called SPRINT, is presented that removes all of the memory restrictions, and is fast and scalable, and designed to be easily parallelized, allowing many processors to work together to build a single consistent model. Expand
Some methods for classification and analysis of multivariate observations
The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to giveExpand
New experimental results in fuzzy clustering
  • E. Ruspini
  • Mathematics, Computer Science
  • Inf. Sci.
  • 1973
TLDR
The modifications presented here resulted in good fuzzy classifications in any, previously established, number of clusters using improved techniques for clustering data in fuzzy sets. Expand
c-means clustering with the l/sub l/ and l/sub infinity / norms
An extension of the hard and fuzzy c-means (HCM/FCM) clustering algorithms is described. Specifically, these models are extended to admit the case where the (dis)similarity measure on pairs ofExpand
A new approach to clustering
TLDR
Estimation theory is used to derive a new approach to the clustering problem, a unification of centroid and mode estimation, achieved by considering the effect of spatial scale on the estimator, which is a multiresolution method which spans a range of spatial scales. Expand
...
1
2
3
...