Fast Matching for All Pairs Similarity Search

Authors:
Amit Awekar;Nagiza F. Samatova
Affiliations:
-;-
Venue:
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2009

Citing 8
Cited 1

Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Measurement and analysis of online social networks

Proceedings of the 7th ACM SIGCOMM conference on Internet measurement
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web

Optimizing parallel algorithms for all pairs similarity search

Proceedings of the sixth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

All pairs similarity search is the problem of finding all pairs of records that have a similarity score above the specified threshold. Many real-world systems like search engines, online social networks, and digital libraries frequently have to solve this problem for datasets having millions of records in a high dimensional space, which are often sparse. The challenge is to design algorithms with feasible time requirements. To meet this challenge, algorithms have been proposed based on the inverted index, which maps each dimension to a list of records with non-zero projection along that dimension. Common to these algorithms is a three-phase framework of data preprocessing, pairs matching, and indexing. Matching is the most time-consuming phase. Within this framework, we propose fast matching technique that uses the sparse nature of real-world data to effectively reduce the size of the search space through a systematic set of tighter filtering conditions and heuristic optimizations. We integrate our technique with the fastest-to-date algorithm in the field and achieve up to 6.5X speed-up on three large real-world datasets.