Text retrieval from early printed books

  • Authors:
  • Simone Marinai

  • Affiliations:
  • Università di Firenze, Dipartimento di Sistemi e Informatica, Firenze, Italy

  • Venue:
  • International Journal on Document Analysis and Recognition - Special issue on noisy text analytics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Retrieving text from early printed books is particularly difficult because in these documents, the words are very close one to the other and, similarly to medieval manuscripts, there is a large use of ligatures and abbreviations. To address these problems, we propose a word indexing and retrieval technique that does not require word segmentation and is tolerant to errors in character segmentation. Two main principles characterize the approach. First, characters are identified in the pages and clustered with self-organizing map (SOM). During the retrieval, the similarity of characters is estimated considering the proximity of cluster centroids in the SOM space, rather than directly comparing the character images. Second, query words are matched with the indexed sequence of characters by means of a dynamic time warping (DTW)-based approach. The proposed technique integrates the SOM similarity and the information about the width of characters in the string matching process. The best path in the DTW array is identified considering the widths of matching words with respect to the query so as to deal with broken or touching symbols. The proposed method is tested on four copies of the Gutenberg Bibles.