HAWKS: Evolving Challenging Benchmark Sets for Cluster Analysis

  title={HAWKS: Evolving Challenging Benchmark Sets for Cluster Analysis},
  author={Cameron Shand and Richard W. Allmendinger and Julia Handl and Andrew M. Webb and John A. Keane},
Comprehensive benchmarking of clustering algorithms is rendered difficult by two key factors: (i) the elusiveness of a unique mathematical definition of this unsupervised learning approach and (ii) dependencies between the generating models or clustering criteria adopted by some clustering algorithms and indices for internal cluster validation. Consequently, there is no consensus regarding the best practice for rigorous benchmarking, and whether this is possible at all outside the context of a… 

Figures and Tables from this paper


Evolving controllably difficult datasets for clustering
HAWKS is introduced, a new data generator that uses an evolutionary algorithm to evolve cluster structure of a synthetic data set and it is demonstrated how such an approach can be used to produce datasets of a pre-specified difficulty.
Clustering - What Both Theoreticians and Practitioners Are Doing Wrong
The severity of this problem is argued, some recent proposals aiming to address this crucial lacuna are described, and the most signif- icant challenge for clustering is model selection is claimed.
Benchmarking in cluster analysis: A white paper
To achieve scientific progress in terms of building a cumulative body of knowledge, careful attention to benchmarking is of the utmost importance. This means that proposals of new methods of data
Clustering algorithms: A comparative approach
A systematic comparison of 9 well-known clustering methods available in the R language assuming normally distributed data revealed that the default configuration of the adopted implementations was not always accurate, and a simple approach based on random selection of parameters values proved to be a good alternative to improve the performance.
Clustering: Science or Art?
It is argued that it will be useful to build a "taxonomy of clustering problems" to identify clustering applications which can be treated in a unified way and that such an effort will be more fruitful than attempting the impossible--developing "optimal" domain-independent clustering algorithms or even classifying clusteringgorithms in terms of how they work.
An Impossibility Theorem for Clustering
A formal perspective on the difficulty in finding a unified framework for reasoning about clustering at a technical level is suggested, in the form of an impossibility theorem: for a set of three simple properties, it is shown that there is no clustering function satisfying all three.
An Analysis of Meta-learning Techniques for Ranking Clustering Algorithms Applied to Artificial Data
This work investigates the use of different components in an unsupervised meta-learning framework and shows that the system, using MLP and SVR meta-learners, was able to successfully associate the proposed sets of dataset characteristics to the performance of the new candidate algorithms.
Cross-disciplinary perspectives on meta-learning for algorithm selection
The generalization of meta-learning concepts to algorithms focused on tasks including sorting, forecasting, constraint satisfaction, and optimization, and the extension of these ideas to bioinformatics, cryptography, and other fields are discussed.