The BBN Byblos Pashto OCR system

Authors:
Michael Decerbo;Ehry MacRostie;Premkumar Natarajan
Affiliations:
BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA
Venue:
Proceedings of the 1st ACM workshop on Hardcopy document processing
Year:
2004

Citing 1
Cited 1

Robust OCR of Degraded Documents

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition

The optical character recognition of Urdu-like cursive scripts

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

The BBN Byblos OCR system implements a script-independent methodology for OCR using Hidden Markov Models (HMMs). We have successfully tested the system with Arabic, English, and Chinese documents. In this paper, we describe our recent effort in training the system to perform recognition of documents in Pashto, one of the national languages of Afghanistan. We discuss the availability and characteristics of suitable experimental data and the methods we used to assemble Pashto training and test corpora. We modeled each form of each Pashto character with an HMM and tested the models on several varieties of document images. On a fair test set consisting of synthetic images alone we measured a character error rate of 1.6%. The character error rate on a fair test set consisting of scanned pages was 2.1%, and the character error rate on a fair test set of faxed pages was 3.1%. On other types of document images, character error rates increased in rough proportion to the level of degradation of the image.