Omnifont and Unlimited-Vocabulary OCR for English and Arabic

Authors:
Issam Bazzi;Chris LaPre;John Makhoul;Chris Raphael;Richard M. Schwartz
Affiliations:
-;-;-;-;-
Venue:
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Year:
1997

Citing 0
Cited 10

An Omnifont Open-Vocabulary OCR System for English and Arabic

IEEE Transactions on Pattern Analysis and Machine Intelligence
Top-Down Likelihood Word Image Generation Model for Holistic Word Recognition

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Recognition of writer-independent off-line handwritten Arabic (Indian) numerals using hidden Markov models

Signal Processing
Recognition of off-line printed Arabic text using Hidden Markov Models

Signal Processing
A multiple feature/resolution scheme to Arabic (Indian) numerals recognition using hidden Markov models

Signal Processing
A novel minimal Arabic script for preparing databases and benchmarks for Arabic text recognition research

WAV'09 Proceedings of the 3rd WSEAS international symposium on Wavelets theory and applications in applied mathematics, signal processing & modern science
The use of radon transform in handwritten Arabic (Indian) numerals recognition

WSEAS Transactions on Computers
Recognition of Arabic (Indian) bank check digits using log-gabor filters

Applied Intelligence
Offline arabic handwritten text recognition: A Survey

ACM Computing Surveys (CSUR)
HMM-based script identification for OCR

Proceedings of the 4th International Workshop on Multilingual OCR

Quantified Score

Hi-index	0.00

Visualization

Abstract

on We present a set of techniques for omnifont, unlimited-vocabulary OCR, within the context of a system based on Hidden Markov Models (HMM). First, we address the issue of how to perform OCR on omnifont and multi-style data, such as plain and italic, without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. We demonstrate mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Second, we show how to use a word-based HMM system to perform character recognition with unlimited vocabulary. The method includes the use of a trigram language model on character sequences. Using all these techniques, we have achieved character error rates of 1.1% on data from the University of Washington English Document Image Database and 3.3% on data from the DARPA Arabic OCR Corpus.