Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
The String-to-String Correction Problem
Journal of the ACM (JACM)
A vector space model for automatic indexing
Communications of the ACM
The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text
Information Retrieval
The Detection of Duplicates in Document Image Databases
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Finding Near-Replicas of Documents and Servers on the Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Models and Algorithms for Duplicate Document Detection
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Duplicate Detection for Symbolically Compressed Documents
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Information Retrieval in Document Image Databases
IEEE Transactions on Knowledge and Data Engineering
Robust document image understanding technologies
Proceedings of the 1st ACM workshop on Hardcopy document processing
Performance evaluation for text processing of noisy inputs
Proceedings of the 2005 ACM symposium on Applied computing
Hi-index | 0.00 |
This paper presents an experimental evaluation of several text-based methods for detecting duplication in scanned document databases using uncorrected OCR output. This task is made challenging both by the wide range of degradations printed documents can suffer, and by conflicting interpretations of what it means to be a “duplicate.” We report results for four sets of experiments exploring various aspects of the problem space. While the techniques studied are generally robust in the face of most types of OCR errors, there are nonetheless important differences which we identify and discuss in detail.