Content-level Annotation of Large Collection of Printed Document Images

  • Authors:
  • A. Kumar;C. V. Jawahar

  • Affiliations:
  • International Institute of Information Technology, Hyderabad - 500032, INDIA;International Institute of Information Technology, Hyderabad - 500032, INDIA

  • Venue:
  • ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is la- borious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed doc- ument images. We align document images with indepen- dently keyed-in text. The method is model-driven and is in- tended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation infor- mation. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other doc- ument understanding tasks.