• Corpus ID: 3007488

CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES

@inproceedings{Huang1997CLUSTERINGLD,
  title={CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES},
  author={Zhexue Huang},
  year={1997}
}
Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. [...] Key Method In the algorithm, objects are clustered against k prototypes. A method is developed to dynamically update the k prototypes in order to maximise the intra cluster similarity of objects. When applied to numeric data the algorithm is identical to the kmeans. To assist interpretation of clusters we use decision tree induction algorithms to create rules for clusters. These rules, together with…Expand
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
TLDR
This paper presents an algorithm, called k-modes, to extend the k-means paradigm to categorical domains, which introduces new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustered cost function.
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
  • J. Huang
  • Mathematics, Computer Science
    Data Mining and Knowledge Discovery
  • 2004
TLDR
Two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values are presented and are shown to be efficient when clustering large data sets, which is critical to data mining applications.
Design and analysis of clustering algorithms for numerical, categorical and mixed data
TLDR
The purpose of this research is to design and analyse clustering algorithms for numerical, categorical and mixed data sets, and a main part of this thesis is devoted to normalisation.
An iterative initial-points refinement algorithm for categorical data clustering
TLDR
Experiments show that the k-modes clustering algorithm using refined initial points leads to higher precision results much more reliably than the random selection method without refinement, thus making the refinement process applicable to many data mining applications with categorical data.
A New Clustering Algorithm of Hybrid Data According to Weights of Attributes
TLDR
This paper introduces an algorithm which has been improved for the clustering of large hybrid data in an effective way that also includes the weights of attributes, mainly based on the K-Prototypes algorithm.
An improved k-prototypes clustering algorithm for mixed numeric and categorical data
TLDR
An improved k-prototypes algorithm to cluster mixed data is proposed, and a new measure to calculate the dissimilarity between data objects and prototypes of clusters is proposed that takes into account the significance of different attributes towards the clustering process.
Modified K-Means Algorithm for Effective Clustering of Categorical Data Sets
Traditional k-means algorithm is well known for its clustering ability and efficiency on large amount of data sets. But this method is well suited for numeric values only and cannot be effectively
An alternative extension of the k-means algorithm for clustering categorical data
Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points.
Clustering Algorithm for Incomplete Data Sets with Mixed Numeric and Categorical Attributes
TLDR
An improved k-prototypes algorithm is proposed in this paper, which employs a new dissimilarity measure for incomplete data set with mixed numeric and categorical attributes and a new approach to select k objects as the initial prototypes based on the nearest neighbors.
Integrated Framework Using Frequent Pattern for Clustering Numeric and Nominal Data Sets
TLDR
An integrated framework using frequent pattern analysis, frequent pattern-based framework for mixed data clustering (FPMC) algorithm, to cluster mixed data in a competent way by performing a one-time clustering along with attribute reduction is proposed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 25 REFERENCES
Automated Construction of Classifications: Conceptual Clustering Versus Numerical Taxonomy
  • R. Michalski, R. Stepp
  • Computer Science, Medicine
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 1983
TLDR
A method for automated construction of classifications called conceptual clustering is described and compared to methods used in numerical taxonomy, in which descriptive concepts are conjunctive statements involving relations on selected object attributes and optimized according to an assumed global criterion of clustering quality.
Cluster analysis
TLDR
This fourth edition of the highly successful Cluster Analysis represents a thorough revision of the third edition and covers new and developing areas such as classification likelihood and neural networks for clustering.
Some methods for classification and analysis of multivariate observations
The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give
c-means clustering with the l/sub l/ and l/sub infinity / norms
An extension of the hard and fuzzy c-means (HCM/FCM) clustering algorithms is described. Specifically, these models are extended to admit the case where the (dis)similarity measure on pairs of
C4.5: Programs for Machine Learning
TLDR
A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
A non-greedy approach to tree-structured clustering
Abstract We propose a new interdisciplinary approach for the hard optimization problem of tree-structured clustering, wherein the imposition of structural constraints on the solution drastically
A deterministic annealing approach to clustering
TLDR
It is shown that as the temperature approaches zero, the algorithm becomes the basic ISODATA algorithm and the method is independent of the initial choice of cluster means.
Programs for Machine Learning
TLDR
In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students.
A General Coefficient of Similarity and Some of Its Properties
A general coefficient measuring the similarity between two sampling units is defined. The matrix of similarities between all pairs of sample units is shown to be positive semidefinite (except
Discrimination and Classification
  • D. Hand
  • Engineering, Computer Science
  • 1981
Presents different approaches to discrimination and classification problems from a statistical perspective. Provides computer projects concentrating on the most widely used and important algorithms,
...
1
2
3
...