A sentence-based copy detection approach for web documents

  • Authors:
  • Rajiv Yerra;Yiu-Kai Ng

  • Affiliations:
  • Computer Science Dept., Brigham Young University, Provo, Utah;Computer Science Dept., Brigham Young University, Provo, Utah

  • Venue:
  • FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter unique information and cause additional storage space, but they also degrade the efficiency of Web information retrieval. In this paper, we present a sentence-based copy detection approach on Web documents, which determines the existence of overlapped portions of any two given Web documents and graphically displays the locations of (semantically the) same sentences detected in the documents. Two sentences are treated as either the same or different according to the degree of similarity of the sentences computed by using either the three least-frequent 4-gram approach or the fuzzy-set information retrieval (IR) approach. Experimental results show that the fuzzy-set IR approach outperforms the three least-frequent 4-gram approach in our copy detection approach, which handles wide range of documents in different subject areas and does not require static word lists.