Hubness-Aware shared neighbor distances for high-dimensional k-nearest neighbor classification

Authors:
Nenad Tomašev;Dunja Mladenić
Affiliations:
Artificial Intelligence Laboratory, Institute Jožef Stefan, Ljubljana, Slovenia;Artificial Intelligence Laboratory, Institute Jožef Stefan, Ljubljana, Slovenia
Venue:
HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II
Year:
2012

Citing 21
Cited 1

Density-based indexing for approximate nearest-neighbor queries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
Clustering Using a Similarity Measure Based on Shared Near Neighbors

IEEE Transactions on Computers
Image clustering based on a shared nearest neighbors approach for tagged collections

CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
Multimedia Data Mining: A Systematic Introduction to Concepts and Theory

Multimedia Data Mining: A Systematic Introduction to Concepts and Theory
Nearest neighbors in high-dimensional data: the emergence and influence of hubs

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
When is 'nearest neighbour' meaningful: A converse theorem and implications

Journal of Complexity
Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

The Journal of Machine Learning Research
Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors

MCS'03 Proceedings of the 4th international conference on Multiple classifier systems
On the existence of obstinate results in vector space models

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Can shared-neighbor distances defeat the curse of dimensionality?

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

The Journal of Machine Learning Research
The role of hubness in clustering high-dimensional data

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
INSIGHT: efficient and effective instance selection for time-series classification

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
On the Inequality of Cover and Hart in Nearest Neighbor Discrimination

IEEE Transactions on Pattern Analysis and Machine Intelligence
A probabilistic approach to nearest-neighbor classification: naive hubness bayesian kNN

Proceedings of the 20th ACM international conference on Information and knowledge management
Nearest Neighbor Voting in High-Dimensional Data: Learning from Past Occurrences

ICDMW '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops
High-dimensional shared nearest neighbor clustering algorithm

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part II

Class imbalance and the curse of minority hubs

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Learning from high-dimensional data is usually quite a challenging task, as captured by the well known phrase curse of dimensionality. Most distance-based methods become impaired due to the distance concentration of many widely used metrics in high-dimensional spaces. One recently proposed approach suggests that using secondary distances based on the number of shared k-nearest neighbors between different points might partly resolve the concentration issue, thereby improving overall performance. Nevertheless, the curse of dimensionality also affects the k-nearest neighbor inference in severely negative ways, one such consequence being known as hubness. The impact of hubness on forming shared neighbor distances has not been discussed before and it is what we focus on in this paper. Furthermore, we propose a new method for calculating the secondary distances which is aware of the underlying neighbor occurrence distribution. Our experiments suggest that this new approach achieves consistently superior performance on all considered high-dimensional data sets. An additional benefit is that it essentially requires no extra computations compared to the original methods.