• Corpus ID: 3007488


  author={Zhexue Huang},
Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. [] Key Method In the algorithm, objects are clustered against k prototypes. A method is developed to dynamically update the k prototypes in order to maximise the intra cluster similarity of objects. When applied to numeric data the algorithm is identical to the kmeans. To assist interpretation of clusters we use decision tree induction algorithms to create rules for clusters. These rules, together with…

Figures and Tables from this paper

A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
This paper presents an algorithm, called k-modes, to extend the k-means paradigm to categorical domains, which introduces new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustered cost function.
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
  • J. Huang
  • Computer Science
    Data Mining and Knowledge Discovery
  • 2004
Two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values are presented and are shown to be efficient when clustering large data sets, which is critical to data mining applications.
Design and analysis of clustering algorithms for numerical, categorical and mixed data
The purpose of this research is to design and analyse clustering algorithms for numerical, categorical and mixed data sets, and a main part of this thesis is devoted to normalisation.
An iterative initial-points refinement algorithm for categorical data clustering
A New Clustering Algorithm of Hybrid Data According to Weights of Attributes
This paper introduces an algorithm which has been improved for the clustering of large hybrid data in an effective way that also includes the weights of attributes, mainly based on the K-Prototypes algorithm.
Modified K-Means Algorithm for Effective Clustering of Categorical Data Sets
Traditional k-means algorithm is well known for its clustering ability and efficiency on large amount of data sets. But this method is well suited for numeric values only and cannot be effectively
An alternative extension of the k-means algorithm for clustering categorical data
This paper shows how to apply the notion of “cluster centers” on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorically objects as a partitioning problem.
Clustering Algorithm for Incomplete Data Sets with Mixed Numeric and Categorical Attributes
An improved k-prototypes algorithm is proposed in this paper, which employs a new dissimilarity measure for incomplete data set with mixed numeric and categorical attributes and a new approach to select k objects as the initial prototypes based on the nearest neighbors.
Integrated Framework Using Frequent Pattern for Clustering Numeric and Nominal Data Sets
An integrated framework using frequent pattern analysis, frequent pattern-based framework for mixed data clustering (FPMC) algorithm, to cluster mixed data in a competent way by performing a one-time clustering along with attribute reduction is proposed.


Automated Construction of Classifications: Conceptual Clustering Versus Numerical Taxonomy
  • R. Michalski, R. Stepp
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 1983
A method for automated construction of classifications called conceptual clustering is described and compared to methods used in numerical taxonomy, in which descriptive concepts are conjunctive statements involving relations on selected object attributes and optimized according to an assumed global criterion of clustering quality.
Some methods for classification and analysis of multivariate observations
The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give
c-means clustering with the l/sub l/ and l/sub infinity / norms
This method broadens the applications horizon of the FCM family by enabling users to match discontinuous multidimensional numerical data structures with similarity measures that have nonhyperelliptical topologies.
C4.5: Programs for Machine Learning
A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
A non-greedy approach to tree-structured clustering
A deterministic annealing approach to clustering
Programs for Machine Learning
In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students.
A General Coefficient of Similarity and Some of Its Properties
A general coefficient measuring the similarity between two sampling units is defined. The matrix of similarities between all pairs of sample units is shown to be positive semidefinite (except
Discrimination and Classification
Presents different approaches to discrimination and classification problems from a statistical perspective. Provides computer projects concentrating on the most widely used and important algorithms,
Genetic Algorithms in Search Optimization and Machine Learning
This book brings together the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields.