Skew Angle Detection of Digitized Indian Script Documents
IEEE Transactions on Pattern Analysis and Machine Intelligence
Document Analysis Systems II
Comparing Images Using the Hausdorff Distance
IEEE Transactions on Pattern Analysis and Machine Intelligence
The Document Spectrum for Page Layout Analysis
IEEE Transactions on Pattern Analysis and Machine Intelligence
An OCR System to Read Two Indian Language Scripts: Bangla and Devnagari (Hindi)
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Text Identification in Noisy Document Images Using Markov Random Field
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Gabor Filter Based Multi-class Classifier for Scanned Document Images
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
A generative probabilistic OCR model for NLP applications
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
OCR error correction using a noisy channel model
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Challenges in OCR of Dev anagari Documents
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
An improved contour-based thinning method for character images
Pattern Recognition Letters
Detection of structural concavities in character images--a writer-independent approach
PerMIn'12 Proceedings of the First Indo-Japan conference on Perception and Machine Intelligence
An approach to offline handwritten Devanagari word segmentation
International Journal of Computer Applications in Technology
Recognition of Bangla compound characters using structural decomposition
Pattern Recognition
Hi-index | 0.00 |
We present an adaptive Hindi OCR implemented as part of a rapidly retargetable language tool effort. The system includes: script identification, character segmentation, training sample creation, and character recognition. In script identification, Hindi words are identified from bilingual or multilingual documents based on features of the Devanagari script or using Support Vector Machines. Identified words are then segmented into individual characters in the next step, where the composite characters are identified and further segmented based on the structural properties of the script and statistical information. Segmented characters are recognized using generalized Hausdorff image comparison (GHIC) and postprocessing is applied to improve the performance. The OCR system, which was designed and implemented in one month, was applied to a complete Hindi--English bilingual dictionary and a set of ideal images extracted from Hindi documents in PDF format. Experimental results show the recognition accuracy can reach 88% for noisy images and 95% for ideal images. The presented method can also be extended to design OCR systems for different scripts.