Parallel and Distributed Document Overlap Detection on the Web

  • Authors:
  • Krisztián Monostori;Arkady B. Zaslavsky;Heinz Schmidt

  • Affiliations:
  • -;-;-

  • Venue:
  • PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Proliferation of digital libraries plus availability of electronic documents from the Internet have created new challenges for computer science researchers and professionals. Documents are easily copied and redistributed or used to create plagiarised assignments and conference papers. This paper presents a new, two-stage approach for identifying overlapping documents. The first stage is identifying a set of candidate documents that are compared in the second stage using a matching-engine. The algorithm of the matching-engine is based on suffix trees and it modifies the known matching statistics algorithm. Parallel and distributed approaches are discussed at both stages and performance results are presented.