Building a scalable and accurate copy detection mechanism
Proceedings of the first ACM international conference on Digital libraries
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Improved robustness of signature-based near-replica detection via lexicon randomization
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
IEEE Transactions on Knowledge and Data Engineering
Inverted files for text search engines
ACM Computing Surveys (CSUR)
A web-based kernel function for measuring the similarity of short text snippets
Proceedings of the 15th international conference on World Wide Web
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Detectives: detecting coalition hit inflation attacks in advertising networks streams
Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Fast Matching for All Pairs Similarity Search
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Adaptive near-duplicate detection via similarity learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Document Similarity Self-Join with MapReduce
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Clustering and load balancing optimization for redundant content removal
Proceedings of the 21st international conference companion on World Wide Web
Cache-conscious performance optimization for similarity search
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Scalable all-pairs similarity search in metric spaces
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Hi-index | 0.00 |
All pairs similarity search is used in many web search and data mining applications. Previous work has used comparison filtering, inverted indexing, and parallel accumulation of partial intermediate results to expedite its execution. However, shuffling intermediate results can incur significant communication overhead as data scales up. This paper studies a scalable two-step approach called Partition-based Similarity Search (PSS) which incorporates several optimization techniques. First, PSS uses a static partitioning algorithm that places dissimilar vectors into different groups and balance the comparison workload with a circular assignment. Second, PSS executes comparison tasks in parallel, each using a hybrid data structure that combines the advantages of forward and inverted indexing. Our evaluation results show that the proposed approach leads to an early elimination of unnecessary I/O and data communication while sustaining parallel efficiency. As a result, it improves performance by an order of magnitude when dealing with large datasets.