A post-processing scheme for malayalam using statistical sub-character language models

Authors:
Karthika Mohan;C. V. Jawahar
Affiliations:
Centre for Visual Information Technology, IIIT-Hyderabad, India;Centre for Visual Information Technology, IIIT-Hyderabad, India
Venue:
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Year:
2010

Citing 10
Cited 3

Contextual postprocessing of a Korean OCR system by linguistic constraints

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Algorithms for postprocessing OCR results with visual inter-word constraints

ICIP '95 Proceedings of the 1995 International Conference on Image Processing (Vol. 3)-Volume 3 - Volume 3
OCR Error Detection and Correction of an Inflectional Indian Language Script

ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume III-Volume 7276 - Volume 7276
A Complete OCR for Printed Hindi Text in Devanagari Script

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
A Shape Based Post Processor for Gurmukhi OCR

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Probabilistic Finite-State Machines-Part I

IEEE Transactions on Pattern Analysis and Machine Intelligence
Empirical Evaluation of Character Classification Schemes

ICAPR '09 Proceedings of the 2009 Seventh International Conference on Advances in Pattern Recognition
Robust Recognition of Documents by Fusing Results of Word Clusters

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Shape Encoded Post Processing of Gurmukhi OCR

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
An OCR post-processing approach based on multi-knowledge

KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part I

Towards recognition of degraded words by probabilistic parsing

Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing
Experiences of integration and performance testing of multilingual OCR for printed Indian scripts

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
On performance analysis of end-to-end OCR systems of Indic scripts

Proceeding of the workshop on Document Analysis and Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most of the Indian scripts do not have any robust commercial OCRs. Many of the laboratory prototypes report reasonable results at recognition/classification stage. However, word level accuracies are still poor. It is well known that word accuracy decreases as the number of characters in a word increase. For Malayalam, the average number of characters in a word is almost twice that of English. Moreover, the number of words required to cover 80% of the Malayalam language is more than forty times that of other Indian languages such as Hindi. Hence a direct dictionary based post-processing scheme is not suitable for Malayalam. In this paper, we propose a post-processing scheme which uses statistical language models at the sub-character level to boost word level recognition results. We use a multi-stage graph representation and formulate the recognition task as an optimization problem. Edges of the graph encode the language information and nodes represent the visual similarities. An optimal path from source node to destination node represents the recognized text. We validate our method on more than 10,000 words from a Malayalam corpus.