Parallel and Distributed Document Overlap Detection on the Web

Authors:
Krisztián Monostori;Arkady B. Zaslavsky;Heinz Schmidt
Affiliations:
-;-;-
Venue:
PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
Year:
2000

Citing 7
Cited 0

Using MPI: portable parallel programming with the message-passing interface

Using MPI: portable parallel programming with the message-passing interface
Building a scalable and accurate copy detection mechanism

Proceedings of the first ACM international conference on Digital libraries
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
CHECK: a document plagiarism detection system

SAC '97 Proceedings of the 1997 ACM symposium on Applied computing
MatchDetectReveal: finding overlapping and similar digital documents

Proceedings of the 2000 information resources management association international conference on Challenges of information technology management in the 21st century
High Performance Parametric Modeling with Nimrod/G: Killer Application for the Global Grid?

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Proliferation of digital libraries plus availability of electronic documents from the Internet have created new challenges for computer science researchers and professionals. Documents are easily copied and redistributed or used to create plagiarised assignments and conference papers. This paper presents a new, two-stage approach for identifying overlapping documents. The first stage is identifying a set of candidate documents that are compared in the second stage using a matching-engine. The algorithm of the matching-engine is based on suffix trees and it modifies the known matching statistics algorithm. Parallel and distributed approaches are discussed at both stages and performance results are presented.