Optimizing parallel algorithms for all pairs similarity search

Authors:
Maha Ahmed Alabduljalil;Xun Tang;Tao Yang
Affiliations:
University of California, Santa Barbara, Santa Barbara, CA, USA;University of California, Santa Barbara, Santa Barbara, CA, USA;University of California, Santa Barbara, Santa Barbara, CA, USA
Venue:
Proceedings of the sixth ACM international conference on Web search and data mining
Year:
2013

Citing 20
Cited 2

Building a scalable and accurate copy detection mechanism

Proceedings of the first ACM international conference on Digital libraries
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval

Modern Information Retrieval
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions

IEEE Transactions on Knowledge and Data Engineering
Inverted files for text search engines

ACM Computing Surveys (CSUR)
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Detectives: detecting coalition hit inflation attacks in advertising networks streams

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Opinion spam and analysis

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Fast Matching for All Pairs Similarity Search

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Document Similarity Self-Join with MapReduce

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Clustering and load balancing optimization for redundant content removal

Proceedings of the 21st international conference companion on World Wide Web

Cache-conscious performance optimization for similarity search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Scalable all-pairs similarity search in metric spaces

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

All pairs similarity search is used in many web search and data mining applications. Previous work has used comparison filtering, inverted indexing, and parallel accumulation of partial intermediate results to expedite its execution. However, shuffling intermediate results can incur significant communication overhead as data scales up. This paper studies a scalable two-step approach called Partition-based Similarity Search (PSS) which incorporates several optimization techniques. First, PSS uses a static partitioning algorithm that places dissimilar vectors into different groups and balance the comparison workload with a circular assignment. Second, PSS executes comparison tasks in parallel, each using a hybrid data structure that combines the advantages of forward and inverted indexing. Our evaluation results show that the proposed approach leads to an early elimination of unnecessary I/O and data communication while sustaining parallel efficiency. As a result, it improves performance by an order of magnitude when dealing with large datasets.