Projective clustering of high dimensional data

Abstract

Clustering of high-dimensional data can be problematic, because the usual notions of distance or similarity break down for data in high dimensions. More specifically, it can be shown that, as the number of dimensions increases, the distance to the nearest point approaches the distance to the farthest one. Two approaches are common for dealing with this problem. The idea behind the first approach is to project all the points to a lower dimensional subspace and then use a standard clustering algorithm on the low-dimensional representation. However, if different subsets of the points cluster well on different subspaces of the original feature space, then a global dimensionality reduction will fail. In the second approach, projection and clustering are performed simultaneously, allowing each cluster to have a different subspace associated with it. These projective clustering algorithms compute pairs (Ci,Di), consisting of the points Ci belonging in cluster i and the subspace Di in which these points have low variance. Three algorithms are presented that follow different approaches to projective clustering. One is a partitional method that iteratively assigns and reestimates the cluster centroids, similar to k-means but with projection steps included in the iteration. The second is density based; it works by extending the clusters to nearby points, where proximity in high dimensions is defined based on the variance of the clusters along different axes. The last algorithm is an ensemble method. It repeatedly performs random projections, which are then clustered using the EM algorithm and combined. The partitional method optimizes a well-defined objective function, but scales poorly to large dimensions. The density-based method scales linearly in the number of dimensions, but it only finds projections to axes-parallel subspaces and not to ones that are arbitrarily rotated. The ensemble method can exploit the diversity of the individual solutions and produces high quality clusters in practice, but lacks theoretical guarantees.

23 Figures and Tables

Cite this paper

@inproceedings{Kandylas2007ProjectiveCO, title={Projective clustering of high dimensional data}, author={Vasileios Kandylas}, year={2007} }