Managing multilingual OCR project using XML
Proceedings of the International Workshop on Multilingual OCR
Experiences of integration and performance testing of multilingual OCR for printed Indian scripts
Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Automatic localization of page segmentation errors
Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Content level access to digital library of India pages
Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing
Automatic localization and correction of line segmentation errors
Proceeding of the workshop on Document Analysis and Recognition
Hi-index | 0.00 |
A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is la- borious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed doc- ument images. We align document images with indepen- dently keyed-in text. The method is model-driven and is in- tended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation infor- mation. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other doc- ument understanding tasks.