An Omnifont Open-Vocabulary OCR System for English and Arabic

Authors:
Issam Bazzi;Richard Schwartz;John Makhoul
Affiliations:
GTE Internetworking, Cambridge, MA;GTE Internetworking, Cambridge, MA;GTE Internetworking, Cambridge, MA
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
1999

Citing 8
Cited 41

Survey and bibliography of Arabic optical text recognition

Signal Processing
Handwritten Word Recognition Using Segmentation-Free Hidden Markov Modeling and Segmentation-Based Dynamic Programming Techniques

IEEE Transactions on Pattern Analysis and Machine Intelligence
Omnifont and Unlimited-Vocabulary OCR for English and Arabic

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
An Experimental HMM-Based Postal OCR System

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 4 - Volume 4
Language-Independent OCR Using a Continuous Speech Recognition System

ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume III-Volume 7276 - Volume 7276
Modelling polyfont printed characters with HMMs and a shift invariant Hamming distance

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Printed PAW Recognition Based on Planar Hidden Markov Models

ICPR '96 Proceedings of the 13th International Conference on Pattern Recognition - Volume 2
Modeling and recognition of cursive words with hidden Markov models

Pattern Recognition

Twenty Years of Document Image Analysis in PAMI

IEEE Transactions on Pattern Analysis and Machine Intelligence
Multilingual machine printed OCR

Hidden Markov models
Coarse-to-Fine Dynamic Programming

IEEE Transactions on Pattern Analysis and Machine Intelligence
An Object-Oriented Progressive-Simplification-Based Vectorization System for Engineering Drawings: Model, Algorithm, and Performance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic Completion of Korean Words for Open Vocabulary Pen Interface

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Offline Recognition of Syntax-Constrained Cursive Handwritten Text

Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition
Style Context with Second-Order Statistics

IEEE Transactions on Pattern Analysis and Machine Intelligence
Style Consistent Classification of Isogenous Patterns

IEEE Transactions on Pattern Analysis and Machine Intelligence
Probabilistic Finite-State Machines-Part II

IEEE Transactions on Pattern Analysis and Machine Intelligence
Texture for Script Identification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Affixal Approach for Arabic Decomposable Vocabulary Recognition: A Validation on Printed Word in Only One Font

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
On Appearance-Based Feature Extraction Methods for Writer-Independent Handwritten Text Recognition

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Offline Arabic Handwriting Recognition: A Survey

IEEE Transactions on Pattern Analysis and Machine Intelligence
Document zone content classification and its performance evaluation

Pattern Recognition
Rejection strategies for offline handwritten text line recognition

Pattern Recognition Letters
Rejection strategies for offline handwritten text line recognition

Pattern Recognition Letters
Offline recognition of omnifont Arabic text using the HMM ToolKit (HTK)

Pattern Recognition Letters
The study of a nonstationary maximum entropy Markov model and its application on the pos-tagging task

ACM Transactions on Asian Language Information Processing (TALIP)
Recognition of writer-independent off-line handwritten Arabic (Indian) numerals using hidden Markov models

Signal Processing
A pictorial dictionary for printed Farsi subwords

Pattern Recognition Letters
Recognition of off-line printed Arabic text using Hidden Markov Models

Signal Processing
Holistic approach for classifying and retrieving personal Arabic handwritten documents

AIKED'08 Proceedings of the 7th WSEAS International Conference on Artificial intelligence, knowledge engineering and data bases
Computer Assisted Transcription of Text Images and Multimodal Interaction

MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Classification of personal Arabic handwritten documents

WSEAS Transactions on Information Science and Applications
A multiple feature/resolution scheme to Arabic (Indian) numerals recognition using hidden Markov models

Signal Processing
HAH manuscripts: A holistic paradigm for classifying and retrieving historical Arabic handwritten documents

Expert Systems with Applications: An International Journal
Multimodal interactive transcription of text images

Pattern Recognition
HMM-based system for recognizing words in historical Arabic manuscript

International Journal of Robotics and Automation
Histogram-based lines and words decomposition for arabic omni font-written OCR systems; enhancements and evaluation

CAIP'07 Proceedings of the 12th international conference on Computer analysis of images and patterns
Recognition of handwritten Arabic (Indian) numerals using Radon-Fourier-based features

ISPRA'10 Proceedings of the 9th WSEAS international conference on Signal processing, robotics and automation
The use of radon transform in handwritten Arabic (Indian) numerals recognition

WSEAS Transactions on Computers
Performance of hidden Markov model and dynamic Bayesian network classifiers on handwritten Arabic word recognition

Knowledge-Based Systems
Recognition of Arabic (Indian) bank check digits using log-gabor filters

Applied Intelligence
Mono-font cursive arabic text recognition using speech recognition system

SSPR'06/SPR'06 Proceedings of the 2006 joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition
A robust free size OCR for omni-font persian/arabic printed document using combined MLP/SVM

CIARP'05 Proceedings of the 10th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis and Applications
Spontaneous handwriting text recognition and classification using finite-state models

IbPRIA'05 Proceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II
Natural language inspired approach for handwritten text line detection in legacy documents

LaTeCH '12 Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Computer assisted transcription for ancient text images

ICIAR'07 Proceedings of the 4th international conference on Image Analysis and Recognition
Offline arabic handwritten text recognition: A Survey

ACM Computing Surveys (CSUR)
A data acquisition and analysis system for palm leaf documents in Telugu

Proceeding of the workshop on Document Analysis and Recognition
KHATT: An open Arabic offline handwritten text database

Pattern Recognition

Quantified Score

Hi-index	0.15

Visualization

Abstract

We present an omnifont, unlimited-vocabulary OCR system for English and Arabic. The system is based on Hidden Markov Models (HMM), an approach that has proven to be very successful in the area of automatic speech recognition. In this paper we focus on two aspects of the OCR system. First, we address the issue of how to perform OCR on omnifont and multi-style data, such as plain and italic, without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. We demonstrate mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Second, we show how to use a word-based HMM system to perform character recognition with unlimited vocabulary. The method includes the use of a trigram language model on character sequences. Using all these techniques, we have achieved character error rates of 1.1 percent on data from the University of Washington English Document Image Database and 3.3 percent on data from the DARPA Arabic OCR Corpus.