Incremental all pairs similarity search for varying similarity thresholds

Authors:
Amit Awekar;Nagiza F. Samatova;Paul Breimyer
Affiliations:
North Carolina State University, Raleigh, NC and Oak Ridge National Laboratory, Oak Ridge, TN;North Carolina State University, Raleigh, NC and Oak Ridge National Laboratory, Oak Ridge, TN;North Carolina State University, Raleigh, NC and Oak Ridge National Laboratory, Oak Ridge, TN
Venue:
Proceedings of the 3rd Workshop on Social Network Mining and Analysis
Year:
2009

Citing 11
Cited 2

Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
Incremental distance join algorithms for spatial databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Evaluating similarity measures: a large-scale study in the orkut social network

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Measurement and analysis of online social networks

Proceedings of the 7th ACM SIGCOMM conference on Internet measurement
Clustering Using a Similarity Measure Based on Shared Near Neighbors

IEEE Transactions on Computers
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Learning multiple graphs for document recommendations

Proceedings of the 17th international conference on World Wide Web
Top-k Set Similarity Joins

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering

Scaling up top-K cosine similarity search

Data & Knowledge Engineering
Tuning large scale deduplication with reduced effort

Proceedings of the 25th International Conference on Scientific and Statistical Database Management

Quantified Score

Hi-index	0.01

Visualization

Abstract

All Pairs Similarity Search (APSS) is a ubiquitous problem in many data mining applications and involves finding all pairs of records with similarity scores above a specified threshold. In this paper, we introduce the problem of Incremental All Pairs Similarity Search (IAPSS), where APSS is performed multiple times over the same dataset by varying the similarity threshold. To the best of our knowledge, this is the first work that addresses the IAPSS problem. All existing solutions for APSS perform redundant computations by invoking APSS independently for each threshold value. In contrast, our solution to the IAPSS problem avoids redundant computations by storing the history of previous APSS invocations and using index splitting. While offering obvious benefits, the computation and I/O intensive nature of the IAPSS solution raises two key research challenges: (1) to develop efficient I/O techniques to manage computation history and (2) to efficiently identify and prune redundant computations. We address these challenges through the proposed (a) history binning technique that clusters record pairs based on similarity values and performs I/O during the similarity computation, and (b) splitting of inverted index that maps each dimension to a list of records that have a non-zero projection along that dimension. As a result, we evaluate the effectiveness of our techniques by demonstrating speed-ups in the order of 2X to over 105 X over the state-of-the-art APSS algorithm for four real-world large-scale datasets.