Robust OCR of Degraded Documents
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
The optical character recognition of Urdu-like cursive scripts
Pattern Recognition
Hi-index | 0.00 |
The BBN Byblos OCR system implements a script-independent methodology for OCR using Hidden Markov Models (HMMs). We have successfully tested the system with Arabic, English, and Chinese documents. In this paper, we describe our recent effort in training the system to perform recognition of documents in Pashto, one of the national languages of Afghanistan. We discuss the availability and characteristics of suitable experimental data and the methods we used to assemble Pashto training and test corpora. We modeled each form of each Pashto character with an HMM and tested the models on several varieties of document images. On a fair test set consisting of synthetic images alone we measured a character error rate of 1.6%. The character error rate on a fair test set consisting of scanned pages was 2.1%, and the character error rate on a fair test set of faxed pages was 3.1%. On other types of document images, character error rates increased in rough proportion to the level of degradation of the image.