Term weighting revisited
A vector space model for automatic indexing
Communications of the ACM
Information Retrieval
Threshold Setting and Performance Optimization in Adaptive Filtering
Information Retrieval
The Concentration of Fractional Distances
IEEE Transactions on Knowledge and Data Engineering
BNS feature scaling: an improved representation over tf-idf for svm text classification
Proceedings of the 17th ACM conference on Information and knowledge management
Nearest neighbors in high-dimensional data: the emergence and influence of hubs
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
How does high dimensionality affect collaborative filtering?
Proceedings of the third ACM conference on Recommender systems
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
The Journal of Machine Learning Research
Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification
MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Hubness-Aware shared neighbor distances for high-dimensional k-nearest neighbor classification
HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II
A survey on unsupervised outlier detection in high-dimensional numerical data
Statistical Analysis and Data Mining
Class imbalance and the curse of minority hubs
Knowledge-Based Systems
Hi-index | 0.00 |
The vector space model (VSM) is a popular and widely applied model in information retrieval (IR). VSM creates vector spaces whose dimensionality is usually high (e.g., tens of thousands of terms). This may cause various problems, such as susceptibility to noise and difficulty in capturing the underlying semantic structure, which are commonly recognized as different aspects of the "curse of dimensionality." In this paper, we investigate a novel aspect of the dimensionality curse, which is referred to as hubness and manifested by the tendency of some documents (called hubs) to be included in unexpectedly many search result lists. Hubness may impact VSM considerably since hubs can become obstinate results, irrelevant to a large number of queries, thus harming the performance of an IR system and the experience of its users. We analyze the origins of hubness, showing it is primarily a consequence of high (intrinsic) dimensionality of data, and not a result of other factors such as sparsity and skewness of the distribution of term frequencies. We describe the mechanisms through which hubness emerges by exploring the behavior of similarity measures in high-dimensional vector spaces. Our consideration begins with the classical VSM (tf-idf term weighting and cosine similarity), but the conclusions generalize to more advanced variations, such as Okapi BM25. Moreover, we explain why hubness may not be easily mitigated by dimensionality reduction, and propose a similarity adjustment scheme that takes into account the existence of hubs. Experimental results over real data indicate that significant improvement can be obtained through consideration of hubness.