On the existence of obstinate results in vector space models

Authors:
Milos Radovanović;Alexandros Nanopoulos;Mirjana Ivanović
Affiliations:
University of Novi Sad, Novi Sad, Serbia;University of Hildesheim, Hildesheim, Germany;University of Novi Sad, Novi Sad, Serbia
Venue:
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Year:
2010

Citing 9
Cited 5

Term weighting revisited

Term weighting revisited
A vector space model for automatic indexing

Communications of the ACM
Information Retrieval

Information Retrieval
Threshold Setting and Performance Optimization in Adaptive Filtering

Information Retrieval
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
A scale-free distribution of false positives for a large class of audio similarity measures

Pattern Recognition
BNS feature scaling: an improved representation over tf-idf for svm text classification

Proceedings of the 17th ACM conference on Information and knowledge management
Nearest neighbors in high-dimensional data: the emergence and influence of hubs

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
How does high dimensionality affect collaborative filtering?

Proceedings of the third ACM conference on Recommender systems

Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

The Journal of Machine Learning Research
Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Hubness-Aware shared neighbor distances for high-dimensional k-nearest neighbor classification

HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
Class imbalance and the curse of minority hubs

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The vector space model (VSM) is a popular and widely applied model in information retrieval (IR). VSM creates vector spaces whose dimensionality is usually high (e.g., tens of thousands of terms). This may cause various problems, such as susceptibility to noise and difficulty in capturing the underlying semantic structure, which are commonly recognized as different aspects of the "curse of dimensionality." In this paper, we investigate a novel aspect of the dimensionality curse, which is referred to as hubness and manifested by the tendency of some documents (called hubs) to be included in unexpectedly many search result lists. Hubness may impact VSM considerably since hubs can become obstinate results, irrelevant to a large number of queries, thus harming the performance of an IR system and the experience of its users. We analyze the origins of hubness, showing it is primarily a consequence of high (intrinsic) dimensionality of data, and not a result of other factors such as sparsity and skewness of the distribution of term frequencies. We describe the mechanisms through which hubness emerges by exploring the behavior of similarity measures in high-dimensional vector spaces. Our consideration begins with the classical VSM (tf-idf term weighting and cosine similarity), but the conclusions generalize to more advanced variations, such as Okapi BM25. Moreover, we explain why hubness may not be easily mitigated by dimensionality reduction, and propose a similarity adjustment scheme that takes into account the existence of hubs. Experimental results over real data indicate that significant improvement can be obtained through consideration of hubness.