Enabling search over large collections of telugu document images – an automatic annotation based approach

Authors:
K. Pramod Sankar;C. V. Jawahar
Affiliations:
Centre for Visual Information Technology, International Institute of Information Technology, Hyderabad, India;Centre for Visual Information Technology, International Institute of Information Technology, Hyderabad, India
Venue:
ICVGIP'06 Proceedings of the 5th Indian conference on Computer Vision, Graphics and Image Processing
Year:
2006

Citing 13
Cited 3

Evaluation of model-based retrieval effectiveness with OCR text

ACM Transactions on Information Systems (TOIS)
Local Grayvalue Invariants for Image Retrieval

IEEE Transactions on Pattern Analysis and Machine Intelligence
The indexing and retrieval of document images: a survey

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Information Retrieval from Documents: A Survey

Information Retrieval
The Document Spectrum for Page Layout Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV
Automatic image annotation and retrieval using cross-media relevance models

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Video Google: A Text Retrieval Approach to Object Matching in Videos

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Managing Document Images in a Digital Library: An Ontology Guided Approach

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
AnnoSearch: Image Auto-Annotation by Search

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Font Adaptive Word Indexing of Modern Printed Documents

IEEE Transactions on Pattern Analysis and Machine Intelligence
Retrieval from document image collections

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Digitizing a million books: challenges for document analysis

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Feature string-based intelligent information retrieval from Tamil document images

International Journal of Computer Applications in Technology
A survey of keyword spotting techniques for printed document images

Artificial Intelligence Review
A line-based representation for matching words in historical manuscripts

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

For the first time, search is enabled over a massive collection of 21 Million word images from digitized document images. This work advances the state-of-the-art on multiple fronts: i) Indian language document images are made searchable by textual queries, ii) interactive content-level access is provided to document images for search and retrieval, iii) a novel recognition-free approach, that does not require an OCR, is adapted and validated iv) a suite of image processing and pattern classification algorithms are proposed to efficiently automate the process and v) the scalability of the solution is demonstrated over a large collection of 500 digitised books consisting of 75,000 pages. Character recognition based approaches yield poor results for developing search engines for Indian language document images, due to the complexity of the script and the poor quality of the documents. Recognition free approaches, based on word-spotting, are not directly scalable to large collections, due to the computational complexity of matching images in the feature space. For example, if it requires 1 mSec to match two images, the retrieval of documents to a single query, from a large collection like ours, would require close to a day's time. In this paper we propose a novel automatic annotation based approach to provide textual description of document images. With a one time, offline computational effort, we are able to build a text-based retrieval system, over annotated images. This system has an interactive response time of about 0.01 second. However, we pay the price in the form of massive offline computation, which is performed on a cluster of 35 computers, for about a month. Our procedure is highly automatic, requiring minimal human intervention.