Enhancing duplicate collection detection through replica boundary discovery

  • Authors:
  • Zhigang Zhang;Weijia Jia;Xiaoming Li

  • Affiliations:
  • Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong;Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong;Institute of Network Computing and Information Systems, School of Electronics Engineering and Computer Science, Peking University, Beijing, China

  • Venue:
  • PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web documents are widely replicated on the Internet. These replicated documents bring potential problems to Web based information systems. So replica detection on the Web is an indispensable task. The challenge is to find these duplicated collections from a very large data set with limited hardware resources in acceptable time. In this paper, we first introduce the notion of replica boundary to roughly reflect the situation of the replicas; then we propose an effective and efficient approach to discover the boundary of the replicas. The advantages of the proposed approach include: first, it dramatically reduces pair-wise document similarity computation, making it much faster than traditional replicated document detection approaches; second, it can identify the boundary of the replicated collections accurately, demonstrating to what extent two collections are replicated. On two web page sets containing 24 million and 30 million Web pages respectively, we evaluated the accuracy of the approach.