# Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

@article{Radovanovi2010HubsIS, title={Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data}, author={Milo{\vs} Radovanovi{\'c} and Alexandros Nanopoulos and Mirjana Ivanovi{\'c}}, journal={J. Mach. Learn. Res.}, year={2010}, volume={11}, pages={2487-2531} }

Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution of k-occurrences: the number of times a point appears among the k nearest neighbors of other points in a data set. Through theoretical and empirical analysis involving synthetic and real data sets we show that under commonly used assumptions this…

## Figures, Tables, and Topics from this paper

## 480 Citations

Local and global scaling reduce hubs in space

- Mathematics, Computer ScienceJ. Mach. Learn. Res.
- 2012

Two classes of methods that try to symmetrize nearest neighbor relations are discussed and to what extent they can mitigate the negative effects of hubs and a real-world application where the methods are able to achieve significantly higher retrieval quality is presented.

Hubness-Aware Shared Neighbor Distances for High-Dimensional k-Nearest Neighbor Classification

- Computer ScienceHAIS
- 2012

This paper proposes a new method for calculating the secondary distances which is aware of the underlying neighbor occurrence distribution and suggests that this new approach achieves consistently superior performance on all considered high-dimensional data sets.

The Hubness Phenomenon in High-Dimensional Spaces

- Computer ScienceAssociation for Women in Mathematics Series
- 2019

This chapter identifies new geometric relationships between hubness, data density, and data distance distribution, as well as betweenhubness, subspaces, and intrinsic dimensionality of data.

A comprehensive empirical comparison of hubness reduction in high-dimensional spaces

- Computer Science, MedicineKnowledge and Information Systems
- 2018

A large-scale empirical evaluation of all available unsupervised hubness reduction methods and dissimilarity measures and their influence on data semantics is investigated, which is measured via nearest neighbor classification.

The Role of Hubness in Clustering High-Dimensional Data

- Mathematics, Computer ScienceIEEE Transactions on Knowledge and Data Engineering
- 2014

This paper shows that hubness, i.e., the tendency of high-dimensional data to contain points (hubs) that frequently occur in k-nearest-neighbor lists of other points, can be successfully exploited in clustering, and proposes several hubness-based clustering algorithms.

The Role of Hubness in Clustering High-Dimensional Data

- Computer ScienceIEEE Trans. Knowl. Data Eng.
- 2014

This paper shows that hubness, i.e., the tendency of high-dimensional data to contain points (hubs) that frequently occur in k-nearest-neighbor lists of other points, can be successfully exploited in clustering, and proposes several hubness-based clustering algorithms.

Can Shared Nearest Neighbors Reduce Hubness in High-Dimensional Spaces?

- Computer Science2013 IEEE 13th International Conference on Data Mining Workshops
- 2013

This study applies SNN to a larger number of high dimensional real world data sets from diverse domains and compares it to two other secondary distance approaches (local scaling and mutual proximity).

Choosing ℓp norms in high-dimensional spaces based on hub analysis

- Computer Science, MedicineNeurocomputing
- 2015

This work proposes an unsupervised approach for choosing an ℓp norm which minimizes hubs while simultaneously maximizing nearest neighbor classification and is evaluated on seven high-dimensional data sets and compared to three approaches that re-scale distances to avoid hubness.

Hubness-Based Clustering of High-Dimensional Data

- Computer Science
- 2015

This chapter reviews and refine existing work which explains the mechanisms of hubness, establishes the location of hub points near central regions of clusters in the data, and shows how hubness can negatively affect existing clustering algorithms by virtue ofhub points lowering between-cluster distance.

Class imbalance and the curse of minority hubs

- Computer ScienceKnowl. Based Syst.
- 2013

It is argued that it might prove beneficial to combine the extensible hubness-aware voting frameworks with the existing class imbalanced kNN classifiers, in order to properly handle class im balanced data in high-dimensional feature spaces.

## References

SHOWING 1-10 OF 91 REFERENCES

Nearest neighbors in high-dimensional data: the emergence and influence of hubs

- Mathematics, Computer ScienceICML '09
- 2009

This paper studies a new aspect of the curse pertaining to the distribution of k-occurrences, i.e., the number of times a point appears among the k nearest neighbors of other points in a data set, and shows that, as dimensionality increases, this distribution becomes considerably skewed and hub points emerge (points with very high k-Occurrences).

On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

- Computer ScienceICDT
- 2001

This paper examines the behavior of the commonly used L k norm and shows that the problem of meaningfulness in high dimensionality is sensitive to the value of k, which means that the Manhattan distance metric is consistently more preferable than the Euclidean distance metric for high dimensional data mining applications.

Time-Series Classification in Many Intrinsic Dimensions

- Computer ScienceSDM
- 2010

A framework for categorizing time-series data sets based on measurements that reflect hubness and the diversity of class labels among nearest neighbors is formed, and the merits of the framework are demonstrated through experimental evaluation of 1-NN and k-NN classifiers, including a proposed weighting scheme that is designed to make use of hubness information.

On the existence of obstinate results in vector space models

- Computer ScienceSIGIR
- 2010

The origins of hubness are analyzed, showing it is primarily a consequence of high (intrinsic) dimensionality of data, and not a result of other factors such as sparsity and skewness of the distribution of term frequencies.

On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'

- Computer ScienceIEEE Trans. Knowl. Data Eng.
- 2001

It is shown how the Hausdorff and Correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict the I/O performance to within one standard deviation on multiple real and synthetic data sets.

How does high dimensionality affect collaborative filtering?

- Computer ScienceRecSys '09
- 2009

This paper addresses two phenomena that emerge when CF algorithms perform NN search in high-dimensional spaces that are typical in CF applications, including similarity concentration and the appearance of hubs.

When is 'nearest neighbour' meaningful: A converse theorem and implications

- Mathematics, Computer ScienceJ. Complex.
- 2009

The converse ofBeyer et al.'s result is established, which shows that the Euclidean distance will not concentrate as long as the amount of 'relevant' dimensions grows no slower than the overall data dimensions.

Taming the Curse of Dimensionality in Kernels and Novelty Detection

- Computer ScienceWSC
- 2004

This paper addresses two issues involving high dimensional data and illustrates methods to overcome dimensionality problems with unsupervised learning utilizing subspace models.

When Is ''Nearest Neighbor'' Meaningful?

- Computer ScienceICDT
- 1999

The effect of dimensionality on the "nearest neighbor" problem is explored, and it is shown that under a broad set of conditions, as dimensionality increases, the Distance to the nearest data point approaches the distance to the farthest data point.

Distance Metric Learning for Large Margin Nearest Neighbor Classification

- Computer Science, MathematicsNIPS
- 2005

This paper shows how to learn a Mahalanobis distance metric for kNN classification from labeled examples in a globally integrated manner and finds that metrics trained in this way lead to significant improvements in kNN Classification.