The BBN Byblos Pashto OCR system

  • Authors:
  • Michael Decerbo;Ehry MacRostie;Premkumar Natarajan

  • Affiliations:
  • BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA

  • Venue:
  • Proceedings of the 1st ACM workshop on Hardcopy document processing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The BBN Byblos OCR system implements a script-independent methodology for OCR using Hidden Markov Models (HMMs). We have successfully tested the system with Arabic, English, and Chinese documents. In this paper, we describe our recent effort in training the system to perform recognition of documents in Pashto, one of the national languages of Afghanistan. We discuss the availability and characteristics of suitable experimental data and the methods we used to assemble Pashto training and test corpora. We modeled each form of each Pashto character with an HMM and tested the models on several varieties of document images. On a fair test set consisting of synthetic images alone we measured a character error rate of 1.6%. The character error rate on a fair test set consisting of scanned pages was 2.1%, and the character error rate on a fair test set of faxed pages was 3.1%. On other types of document images, character error rates increased in rough proportion to the level of degradation of the image.