Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

Authors:
Miloš Radovanović;Alexandros Nanopoulos;Mirjana Ivanović
Affiliations:
-;-;-
Venue:
The Journal of Machine Learning Research
Year:
2010

Citing 37
Cited 16

Encyclopedic dictionary of mathematics (2nd ed.)

Encyclopedic dictionary of mathematics (2nd ed.)
Neural networks for pattern recognition

Neural networks for pattern recognition
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
Independent component analysis: algorithms and applications

Neural Networks
Soft Margins for AdaBoost

Machine Learning
Outlier detection for high dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Self-Organizing Maps

Self-Organizing Maps
On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'

IEEE Transactions on Knowledge and Data Engineering
Supervised dimension reduction of intrinsically low-dimensional data

Neural Computation
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
What Is the Nearest Neighbor in High Dimensional Spaces?

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
High dimensional reverse nearest neighbor queries

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,

Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,
Improvements to Platt's SMO Algorithm for SVM Classifier Design

Neural Computation
Diffusion Maps and Coarse-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Multidimensional reverse kNN search

The VLDB Journal — The International Journal on Very Large Data Bases
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
A scale-free distribution of false positives for a large class of audio similarity measures

Pattern Recognition
Enhanced 1-NN time series classification using badness of records

Proceedings of the 2nd international conference on Ubiquitous information management and communication
An empirical evaluation of supervised learning in high dimensions

Proceedings of the 25th international conference on Machine learning
On the Design and Applicability of Distance Functions in High-Dimensional Data Space

IEEE Transactions on Knowledge and Data Engineering
Graph construction and b-matching for semi-supervised learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Nearest neighbors in high-dimensional data: the emergence and influence of hubs

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
When is 'nearest neighbour' meaningful: A converse theorem and implications

Journal of Complexity
Distance Metric Learning for Large Margin Nearest Neighbor Classification

The Journal of Machine Learning Research
Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

The Journal of Machine Learning Research
How does high dimensionality affect collaborative filtering?

Proceedings of the third ACM conference on Recommender systems
Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

The Journal of Machine Learning Research
On the existence of obstinate results in vector space models

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Document representations for classification of short web-page descriptions

DaWaK'06 Proceedings of the 8th international conference on Data Warehousing and Knowledge Discovery
Supervised nonlinear dimensionality reduction for visualization and classification

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

The role of hubness in clustering high-dimensional data

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Using the mutual k-nearest neighbor graphs for semi-supervised classification of natural language data

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Quality of similarity rankings in time series

SSTD'11 Proceedings of the 12th international conference on Advances in spatial and temporal databases
A probabilistic approach to nearest-neighbor classification: naive hubness bayesian kNN

Proceedings of the 20th ACM international conference on Information and knowledge management
Non-parametric detection of meaningless distances in high dimensional data

Statistics and Computing
A probabilistic approach for semi-supervised nearest neighbor classification

Pattern Recognition Letters
Hubness-Aware shared neighbor distances for high-dimensional k-nearest neighbor classification

HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
Multimedia search and retrieval using multimodal annotation propagation and indexing techniques

Image Communication
Visualizing the quality of dimensionality reduction

Neurocomputing
Instance selection for time series classification based on immune binary particle swarm optimization

Knowledge-Based Systems
Local and global scaling reduce hubs in space

The Journal of Machine Learning Research
Class imbalance and the curse of minority hubs

Knowledge-Based Systems
On the mutual nearest neighbors estimate in regression

The Journal of Machine Learning Research
Local Mutual Information for Dissimilarity-Based Image Segmentation

Journal of Mathematical Imaging and Vision

Quantified Score

Hi-index	0.00

Visualization

Abstract

Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution of k-occurrences: the number of times a point appears among the k nearest neighbors of other points in a data set. Through theoretical and empirical analysis involving synthetic and real data sets we show that under commonly used assumptions this distribution becomes considerably skewed as dimensionality increases, causing the emergence of hubs, that is, points with very high k-occurrences which effectively represent "popular" nearest neighbors. We examine the origins of this phenomenon, showing that it is an inherent property of data distributions in high-dimensional vector space, discuss its interaction with dimensionality reduction, and explore its influence on a wide range of machine-learning tasks directly or indirectly based on measuring distances, belonging to supervised, semi-supervised, and unsupervised learning families.