Scaling up top-K cosine similarity search

  • Authors:
  • Shiwei Zhu;Junjie Wu;Hui Xiong;Guoping Xia

  • Affiliations:
  • Information Systems Department, School of Economics and Management, Beihang University, Beijing 100191, China;Information Systems Department, School of Economics and Management, Beihang University, Beijing 100191, China;Management Science and Information Systems Department, Rutgers Business School - Newark and New Brunswick, Rutgers University, Newark, NJ 07102, USA;Information Systems Department, School of Economics and Management, Beihang University, Beijing 100191, China

  • Venue:
  • Data & Knowledge Engineering
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent years have witnessed an increased interest in computing cosine similarity in many application domains. Most previous studies require the specification of a minimum similarity threshold to perform the cosine similarity computation. However, it is usually difficult for users to provide an appropriate threshold in practice. Instead, in this paper, we propose to search top-K strongly correlated pairs of objects as measured by the cosine similarity. Specifically, we first identify the monotone property of an upper bound of the cosine measure and exploit a diagonal traversal strategy for developing a TOP-DATA algorithm. In addition, we observe that a diagonal traversal strategy usually leads to more I/O costs. Therefore, we develop a max-first traversal strategy and propose a TOP-MATA algorithm. A theoretical analysis shows that TOP-MATA has the advantages of saving the computations for false-positive item pairs and can significantly reduce I/O costs. Finally, experimental results demonstrate the computational efficiencies of both TOP-DATA and TOP-MATA algorithms. Also, we show that TOP-MATA is particularly scalable for large-scale data sets with a large number of items.