Font Adaptive Word Indexing of Modern Printed Documents

Authors:
Simone Marinai;Emanuele Marino;Giovanni Soda
Affiliations:
-;-;IEEE Computer Society
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
2006

Citing 18
Cited 8

Evaluation of model-based retrieval effectiveness with OCR text

ACM Transactions on Information Systems (TOIS)
A Survey of Methods and Strategies in Character Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
INFORMys: A Flexible Invoice-Like Form-Reader System

IEEE Transactions on Pattern Analysis and Machine Intelligence
The indexing and retrieval of document images: a survey

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
The Role of Holistic Paradigms in Handwritten Word Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Modern Information Retrieval

Modern Information Retrieval
Self-Organizing Maps

Self-Organizing Maps
Imaged Document Text Retrieval Without OCR

IEEE Transactions on Pattern Analysis and Machine Intelligence
Information Retrieval from Documents: A Survey

Information Retrieval
Word Spotting in Bitmapped Fax Documents

Information Retrieval
Indexing and retrieval of words in old documents

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
A search engine for historical manuscript images

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval in Document Image Databases

IEEE Transactions on Knowledge and Data Engineering
Artificial Neural Networks for Document Analysis and Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Exact indexing of dynamic time warping

Knowledge and Information Systems
Layout based document image retrieval by means of XY tree reduction

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Eigenspace Method for Text Retrieval in Historical Document Images

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition

Embedded Map Projection for Dimensionality Reduction-Based Similarity Search

SSPR & SPR '08 Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition
Towards an omnilingual word retrieval system for ancient manuscripts

Pattern Recognition
Text retrieval from early printed books

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Efficient Language-Independent Retrieval of Printed Documents without OCR

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Nonlinear Embedded Map Projection for Dimensionality Reduction

ICIAP '09 Proceedings of the 15th International Conference on Image Analysis and Processing
Word spotting in historical printed documents using shape and sequence comparisons

Pattern Recognition
Enabling search over large collections of telugu document images – an automatic annotation based approach

ICVGIP'06 Proceedings of the 5th Indian conference on Computer Vision, Graphics and Image Processing
Exploring digital libraries with document image retrieval

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries

Quantified Score

Hi-index	0.14

Visualization

Abstract

We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of Self Organizing Maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals.