Models and Algorithms for Duplicate Document Detection

Authors:
Daniel P. Lopresti
Affiliations:
-
Venue:
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Year:
1999

Citing 0
Cited 6

A Comparison of Text-Based Methods for Detecting Duplication in Scanned Document Databases

Information Retrieval
A text image enhancement system based on segmentation and classification methods

Proceedings of the 1st ACM workshop on Hardcopy document processing
Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
Beyond topical similarity: a structural similarity measure for retrieving highly similar documents

Knowledge and Information Systems
Improving web information indexing and retrieval based on center block duplication detection

International Journal of Innovative Computing and Applications
Query by document via a decomposition-based two-level retrieval approach

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a framework for clarifying and formalizing the duplicate document detection problem. Four distinct models are presented, each with a corresponding algorithm for its solution derived from the realm of approximate string matching. The robustness of these techniques is demonstrated through a set of experiments using data reflecting real-world degradation effects.