A Comparison of Text-Based Methods for Detecting Duplication in Scanned Document Databases

Authors:
Daniel P. Lopresti
Affiliations:
Bell Labs, Lucent Technologies Inc., 600 Mountain Avenue, Room 2D-447, Murray Hill, NJ 07974, USA. dpl@research.bell-labs.com
Venue:
Information Retrieval
Year:
2001

Citing 8
Cited 3

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
The String-to-String Correction Problem

Journal of the ACM (JACM)
A vector space model for automatic indexing

Communications of the ACM
The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

Information Retrieval
The Detection of Duplicates in Document Image Databases

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Models and Algorithms for Duplicate Document Detection

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Duplicate Detection for Symbolically Compressed Documents

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition

Information Retrieval in Document Image Databases

IEEE Transactions on Knowledge and Data Engineering
Robust document image understanding technologies

Proceedings of the 1st ACM workshop on Hardcopy document processing
Performance evaluation for text processing of noisy inputs

Proceedings of the 2005 ACM symposium on Applied computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an experimental evaluation of several text-based methods for detecting duplication in scanned document databases using uncorrected OCR output. This task is made challenging both by the wide range of degradations printed documents can suffer, and by conflicting interpretations of what it means to be a “duplicate.” We report results for four sets of experiments exploring various aspects of the problem space. While the techniques studied are generally robust in the face of most types of OCR errors, there are nonetheless important differences which we identify and discuss in detail.