Adaptive Hindi OCR using generalized Hausdorff image comparison

Authors:
Huanfeng Ma;David Doermann
Affiliations:
University of Maryland, College Park, MD;University of Maryland, College Park, MD
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2003

Citing 9
Cited 5

Skew Angle Detection of Digitized Indian Script Documents

IEEE Transactions on Pattern Analysis and Machine Intelligence
Document Analysis Systems II

Document Analysis Systems II
Comparing Images Using the Hausdorff Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
The Document Spectrum for Page Layout Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
An OCR System to Read Two Indian Language Scripts: Bangla and Devnagari (Hindi)

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Text Identification in Noisy Document Images Using Markov Random Field

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Gabor Filter Based Multi-class Classifier for Scanned Document Images

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
A generative probabilistic OCR model for NLP applications

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
OCR error correction using a noisy channel model

HLT '02 Proceedings of the second international conference on Human Language Technology Research

Challenges in OCR of Dev anagari Documents

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
An improved contour-based thinning method for character images

Pattern Recognition Letters
Detection of structural concavities in character images--a writer-independent approach

PerMIn'12 Proceedings of the First Indo-Japan conference on Perception and Machine Intelligence
An approach to offline handwritten Devanagari word segmentation

International Journal of Computer Applications in Technology
Recognition of Bangla compound characters using structural decomposition

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an adaptive Hindi OCR implemented as part of a rapidly retargetable language tool effort. The system includes: script identification, character segmentation, training sample creation, and character recognition. In script identification, Hindi words are identified from bilingual or multilingual documents based on features of the Devanagari script or using Support Vector Machines. Identified words are then segmented into individual characters in the next step, where the composite characters are identified and further segmented based on the structural properties of the script and statistical information. Segmented characters are recognized using generalized Hausdorff image comparison (GHIC) and postprocessing is applied to improve the performance. The OCR system, which was designed and implemented in one month, was applied to a complete Hindi--English bilingual dictionary and a set of ideal images extracted from Hindi documents in PDF format. Experimental results show the recognition accuracy can reach 88% for noisy images and 95% for ideal images. The presented method can also be extended to design OCR systems for different scripts.