Enhancing duplicate collection detection through replica boundary discovery

Authors:
Zhigang Zhang;Weijia Jia;Xiaoming Li
Affiliations:
Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong;Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong;Institute of Network Computing and Information Systems, School of Electronics Engineering and Computer Science, Peking University, Beijing, China
Venue:
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Year:
2006

Citing 14
Cited 0

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Building a scalable and accurate copy detection mechanism

Proceedings of the first ACM international conference on Digital libraries
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A comparison of techniques to find mirrored hosts on the WWW

Journal of the American Society for Information Science
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Machine Learning Approach for Homepage Finding Task

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Challenges in web search engines

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
A preprocessing framework and approach for web applications

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web documents are widely replicated on the Internet. These replicated documents bring potential problems to Web based information systems. So replica detection on the Web is an indispensable task. The challenge is to find these duplicated collections from a very large data set with limited hardware resources in acceptable time. In this paper, we first introduce the notion of replica boundary to roughly reflect the situation of the replicas; then we propose an effective and efficient approach to discover the boundary of the replicas. The advantages of the proposed approach include: first, it dramatically reduces pair-wise document similarity computation, making it much faster than traditional replicated document detection approaches; second, it can identify the boundary of the replicated collections accurately, demonstrating to what extent two collections are replicated. On two web page sets containing 24 million and 30 million Web pages respectively, we evaluated the accuracy of the approach.