Text retrieval from early printed books

Authors:
Simone Marinai
Affiliations:
University of Firenze, Firenze, Italy
Venue:
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Year:
2009

Citing 12
Cited 0

Imaged Document Text Retrieval Without OCR

IEEE Transactions on Pattern Analysis and Machine Intelligence
Indexing and Retrieval of On-line Handwritten Documents

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Automatic Indexing and Reformulation of Ancient Dictionaries

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Retrieving poorly degraded OCR documents

International Journal on Document Analysis and Recognition
Font Adaptive Word Indexing of Modern Printed Documents

IEEE Transactions on Pattern Analysis and Machine Intelligence
OCR binarization and image pre-processing for searching historical documents

Pattern Recognition
Word spotting for historical documents

International Journal on Document Analysis and Recognition
Keyword-guided word spotting in historical printed documents using synthetic data and user feedback

International Journal on Document Analysis and Recognition
Optical character recognition errors and their effects on natural language processing

Proceedings of the second workshop on Analytics for noisy unstructured text data
Matching word images for content-based retrieval from printed document images

International Journal on Document Analysis and Recognition
Document Image Retrieval through Word Shape Coding

IEEE Transactions on Pattern Analysis and Machine Intelligence
A probabilistic method for keyword retrieval in handwritten document images

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a text indexing and retrieval technique that does not rely on word segmentation and is tolerant to errors in character segmentation. The method is designed to process early printed documents and we evaluate it on the well known Latin Gutenberg Bible. The approach relies on two main components. First, character objects (in most cases corresponding to individual characters) are extracted from the document and clustered together, so as to assign a symbolic class to each indexed object. Second, a query word is compared against the indexed character objects with a Dynamic Time Warping (DTW) based approach. The peculiarity of the matching technique described in this paper is the incorporation of sub-symbolic information in the string matching process. In particular, we take into account the estimated widths of potential subwords that are computed by accumulating lengths of partial matches in the DTW array.