The role of hubness in clustering high-dimensional data

Authors:
Nenad Tomašev;Miloš Radovanovič;Dunja Mladenič;Mirjana Ivanovič
Affiliations:
Institute Jozef Stefan, Artificial Intelligence Laboratory, Ljubljana, Slovenia;University of Novi Sad, Department of Mathematics and Informatics, Novi Sad, Serbia;Inst itute Jozef Stefan, Artificial Intelligence Laboratory, Ljubljana, Slovenia;University of Novi Sad, Department of Mathematics and Informatics, Novi Sad, Serbia
Venue:
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Year:
2011

Citing 16
Cited 5

New ideas in optimization

New ideas in optimization
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization

Proceedings of the 2004 ACM symposium on Applied computing
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Neighbor number, valley seeking and clustering

Pattern Recognition Letters
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Locally Scaled Density Based Clustering

ICANNGA '07 Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part I
When is 'nearest neighbour' meaningful: A converse theorem and implications

Journal of Complexity
Two graph-based algorithms for state-of-the-art WSD

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

The Journal of Machine Learning Research
Fast agglomerative clustering using information of k-nearest neighbors

Pattern Recognition
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

The Journal of Machine Learning Research

Hubness-Aware shared neighbor distances for high-dimensional k-nearest neighbor classification

HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II
SOHAC: efficient storage of tick data that supports search and analysis

ICDM'12 Proceedings of the 12th Industrial conference on Advances in Data Mining: applications and theoretical aspects
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
Projective clustering ensembles

Data Mining and Knowledge Discovery
Class imbalance and the curse of minority hubs

Knowledge-Based Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

High-dimensional data arise naturally in many domains, and have regularly presented a great challenge for traditional data-mining techniques, both in terms of effectiveness and efficiency. Clustering becomes difficult due to the increasing sparsity of such data, as well as the increasing difficulty in distinguishing distances between data points. In this paper we take a novel perspective on the problem of clustering high-dimensional data. Instead of attempting to avoid the curse of dimensionality by observing a lower-dimensional feature subspace, we embrace dimensionality by taking advantage of some inherently high-dimensional phenomena. More specifically, we show that hubness, i.e., the tendency of high-dimensional data to contain points (hubs) that frequently occur in k-nearest neighbor lists of other points, can be successfully exploited in clustering. We validate our hypothesis by proposing several hubness-based clustering algorithms and testing them on high-dimensional data. Experimental results demonstrate good performance of our algorithms in multiple settings, particularly in the presence of large quantities of noise.