How does high dimensionality affect collaborative filtering?

Authors:
Alexandros Nanopoulos;Miloš Radovanović;Mirjana Ivanović
Affiliations:
University of Hildesheim, Hildesheim, Germany;University of Novi Sad, Novi Sad, Serbia;University of Novi Sad, Novi Sad, Serbia
Venue:
Proceedings of the third ACM conference on Recommender systems
Year:
2009

Citing 7
Cited 3

Item-based collaborative filtering recommendation algorithms

Proceedings of the 10th international conference on World Wide Web
What Is the Nearest Neighbor in High Dimensional Spaces?

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Feature Weighting and Instance Selection for Collaborative Filtering: An Information-Theoretic Approach

Knowledge and Information Systems
Unifying user-based and item-based collaborative filtering approaches by similarity fusion

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
Nearest neighbors in high-dimensional data: the emergence and influence of hubs

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Data sparsity issues in the collaborative filtering framework

WebKDD'05 Proceedings of the 7th international conference on Knowledge Discovery on the Web: advances in Web Mining and Web Usage Analysis

On the existence of obstinate results in vector space models

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

The Journal of Machine Learning Research
Class imbalance and the curse of minority hubs

Knowledge-Based Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

A crucial operation in memory-based collaborative filtering (CF) is determining nearest neighbors (NNs) of users/items. This paper addresses two phenomena that emerge when CF algorithms perform NN search in high-dimensional spaces that are typical in CF applications. The first is similarity concentration and the second is the appearance of hubs (i.e. points which appear in $k$-NN lists of many other points). Through theoretical analysis and experimental evaluation we show that these phenomena are inherent properties of high-dimensional space, unrelated to other data properties like sparsity, and that they can impact CF algorithms by questioning the meaning and representativeness of discovered NNs. Moreover, we show that it is not easy to mitigate the phenomena using dimensionality reduction. Studying these phenomena aims to provide a better understanding of the limitations of memory-based CF and motivate the development of new algorithms that would overcome them.