Content-level Annotation of Large Collection of Printed Document Images

Authors:
A. Kumar;C. V. Jawahar
Affiliations:
International Institute of Information Technology, Hyderabad - 500032, INDIA;International Institute of Information Technology, Hyderabad - 500032, INDIA
Venue:
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Year:
2007

Citing 0
Cited 6

Managing multilingual OCR project using XML

Proceedings of the International Workshop on Multilingual OCR
Experiences of integration and performance testing of multilingual OCR for printed Indian scripts

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Automatic localization of page segmentation errors

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Content level access to digital library of India pages

Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing
Automatic localization and correction of line segmentation errors

Proceeding of the workshop on Document Analysis and Recognition
Transcript mapping for handwritten Chinese documents by integrating character recognition model and geometric context

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is la- borious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed doc- ument images. We align document images with indepen- dently keyed-in text. The method is model-driven and is in- tended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation infor- mation. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other doc- ument understanding tasks.