The Detection of Duplicates in Document Image Databases

Authors:
David S. Doermann;Huiping Li;Omid E. Kia
Affiliations:
-;-;-
Venue:
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Year:
1997

Citing 0
Cited 10

Information Retrieval from Documents: A Survey

Information Retrieval
Comparison and Classification of Documents Based on Layout Similarity

Information Retrieval
A Comparison of Text-Based Methods for Detecting Duplication in Scanned Document Databases

Information Retrieval
Substitution Deciphering Based on HMMs with Applications to Compressed Document Processing

IEEE Transactions on Pattern Analysis and Machine Intelligence
Duplicate detection in consumer photography and news video

Proceedings of the tenth ACM international conference on Multimedia
Document Image Recognition Based on Template Matching of Component Block Projections

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Segmentation-free Approach for Keyword Search in Historical Typewritten Documents

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Accessing the content of Greek historical documents

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Robust image based document comparison using attributed relational graphs

SPPRA '08 Proceedings of the Fifth IASTED International Conference on Signal Processing, Pattern Recognition and Applications
Use of affine invariants in locally likely arrangement hashing for camera-based document image retrieval

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we propose and implement a method for detecting duplicate documents in very large image databases. The method is based on a robust "signature" extracted from each document image which is used to index into a table of previously processed documents. The approach has a number of advantages over OCR or other recognition based methods including speed and robustness to imaging distortions. To justify the approach and test the scalability, we have developed a simulator which allows us to change parameters of the system and examine performance for millions of document signatures. A complete system is implemented and tested on a test collection of technical articles and memos.