Indexing and retrieval of words in old documents

Authors:
Simone Marinai;Emanuele Marino;Giovanni Soda
Affiliations:
-;-;-
Venue:
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Year:
2003

Citing 8
Cited 10

The indexing and retrieval of document images: a survey

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Self-Organizing Maps

Self-Organizing Maps
Managing Gigabytes: Compressing and Indexing Documents and Images

Managing Gigabytes: Compressing and Indexing Documents and Images
Imaged Document Text Retrieval Without OCR

IEEE Transactions on Pattern Analysis and Machine Intelligence
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Using Character Shape Coding for Information Retrieval

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Robust Retrieval of Noisy Text

ADL '96 Proceedings of the 3rd International Forum on Research and Technology Advances in Digital Libraries
Document Filtering for Fast Approximate String Matching of Errorneous Text

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition

Artificial Neural Networks for Document Analysis and Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Influence of fusion strategies on feature-based identification of low-resolution documents

Proceedings of the 2005 ACM symposium on Document engineering
Font Adaptive Word Indexing of Modern Printed Documents

IEEE Transactions on Pattern Analysis and Machine Intelligence
Towards an omnilingual word retrieval system for ancient manuscripts

Pattern Recognition
A probabilistic method for keyword retrieval in handwritten document images

Pattern Recognition
Query driven word retrieval in graphical documents

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Translating handwritten bushman texts

Proceedings of the 10th annual joint conference on Digital libraries
Efficient word retrieval by means of SOM clustering and PCA

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Automatic keyword extraction from historical document images

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Exploring digital libraries with document image retrieval

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a system for efficient indexingand retrieval of words in collections of documen images. The proposed method is based on two main principles: unsupervised prototype clustering, and stringencoding for efficient string matching. During indexing, a self organizing map (SOM) is trained so as tocluster together similar symbols (character-like objects)in a sub-set of the documents to be stored. By using thetrained SOM the words in the whole collection can bestored and represented with a fixed-length description,that can be easily compared in order to score most similar words in response to a user query.The system can be automatically adapted to differentlanguages and fon styles. The most appropriate applications are for the processing of old documents (18th and 19th Centuries) where current OCRs have moredifficulties. Experimental results describe three application scenarios having various levels of difficulty for current OCR systems.