Indexing and retrieval of words in old documents

  • Authors:
  • Simone Marinai;Emanuele Marino;Giovanni Soda

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a system for efficient indexingand retrieval of words in collections of documen images. The proposed method is based on two main principles: unsupervised prototype clustering, and stringencoding for efficient string matching. During indexing, a self organizing map (SOM) is trained so as tocluster together similar symbols (character-like objects)in a sub-set of the documents to be stored. By using thetrained SOM the words in the whole collection can bestored and represented with a fixed-length description,that can be easily compared in order to score most similar words in response to a user query.The system can be automatically adapted to differentlanguages and fon styles. The most appropriate applications are for the processing of old documents (18th and 19th Centuries) where current OCRs have moredifficulties. Experimental results describe three application scenarios having various levels of difficulty for current OCR systems.