Efficient partial-duplicate detection based on sequence matching

  • Authors:
  • Qi Zhang;Yue Zhang;Haomin Yu;Xuanjing Huang

  • Affiliations:
  • School of Computer Science, Fudan University, Shanghai, China;School of Computer Science, Fudan University, Shanghai, China;School of Computer Science, Fudan University, Shanghai, China;School of Computer Science, Fudan University, Shanghai, China

  • Venue:
  • Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the ever-increasing growth of the Internet, numerous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-duplicates only contain a small piece of text taken from other sources and most existing near-duplicate detection approaches focus on document level, partial duplicates can not be dealt with well. In this paper, we propose a novel algorithm to realize the partial-duplicate detection task. Besides the similarities between documents, our proposed algorithm can simultaneously locate the duplicated parts. The main idea is to divide the partial-duplicate detection task into two subtasks: sentence level near-duplicate detection and sequence matching. For evaluation, we compare the proposed method with other approaches on both English and Chinese web collections. Experimental results appear to support that our proposed method is effectively and efficiently to detect both partial-duplicates on large web collections.