Achieving both high precision and high recall in near-duplicate detection

Authors:
Lian'en Huang;Lei Wang;Xiaoming Li
Affiliations:
Peking University, Beijing, China;Shenzhen Graduate School of Peking University, Shenzhen, China;Peking University, Beijing, China
Venue:
Proceedings of the 17th ACM conference on Information and knowledge management
Year:
2008

Citing 14
Cited 5

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
The String-to-String Correction Problem

Journal of the ACM (JACM)
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Multiple-signal duplicate detection for search evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Principles of hash-based text retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Robust record linkage blocking using suffix arrays

Proceedings of the 18th ACM conference on Information and knowledge management
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Applying VSM and LCS to develop an integrated text retrieval mechanism

Expert Systems with Applications: An International Journal
A novel burst-based text representation model for scalable event detection

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

To find near-duplicate documents, fingerprint-based paradigms such as Broder's shingling and Charikar's simhash algorithms have been recognized as effective approaches and are considered the state-of-the-art. Nevertheless, we see two aspects of these approaches which may be improved. First, high score under these algorithms' similarity measurement implies high probability of similarity between documents, which is different from high similarity of the documents. But how similar two documents are is what we really need to know. Second, there has to be a tradeoff between hash-code length and hash-code multiplicity in fingerprint paradigms, which makes it hard to maintain a satisfactory recall level while improving precision. In this paper our contributions are two-folded. First, we propose a framework for implementing the longest common subsequence (LCS) as a similarity measurement in reasonable computing time, which leads to both high precision and recall. Second, we present an algorithm to get a trustable partition from the LCS to reduce the negative impact from templates used in web page design. A comprehensive experiment was conducted to evaluate our method in terms of its effectiveness, efficiency, and quality of result. More specifically, the method has been successfully used to partition a set of 430 million web pages into 68 million subsets of similar pages, which demonstrates its effectiveness. For quality, we compared our method with simhash and a Cosine-based method through a sampling process (Cosine is compared to LCS as an alternative similarity measurement). The result showed that our algorithm reached an overall precision of 0.95 while simhash was 0.71 and Cosine was 0.82. At the same time our method obtains 1.86 times as much recall as simhash and 1.56 times as much recall as Cosine. Comparison experiment was also done for documents in the same web sites. For that, our algorithm, simhash and Cosine find almost the same number of true-positives at a precision of 0.91, 0.50 and 0.63 respectively. In terms of efficiency, our algorithm takes 118 hours to process the whole archive of 430 million topic-type pages on a cluster of six Linux boxes, at the same time the processing time of simhash and Cosine is 94 hours and 68 hours respectively. When considering the need of word segmentation for languages such as Chinese, the processing time of Cosine should be multiplied and in our experiment it is 602 hours.