A bilingual Gurmukhi-English OCR based on multiple script identifiers and language models

  • Authors:
  • Gurpreet Singh Lehal

  • Affiliations:
  • Punjabi University, Patiala, Punjab, India

  • Venue:
  • Proceedings of the 4th International Workshop on Multilingual OCR
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

English words are frequently encountered in Gurmukhi texts. A monolingual Gurmukhi OCR will recognize such words as garbage. It becomes necessary to add bilingual capability to the Gurmukhi OCR to recognize English text too. But adding bilingual capability reduces the recognition accuracy for monolingual texts due to errors in script identification. Even a system with 99% script identification accuracy results in reduction of 1% recognition accuracy on monolingual text. In this paper, we present a bilingual OCR, which recognizes both English and Gurmukhi scripts without any significant reduction in recognition accuracy as compared to the monolingual Gurmukhi OCR when recognizing monolingual Gurmukhi text. This is achieved by using multiple script identification engines and language models for both English and Gurmukhi scripts. For the first time, such a system has been developed, which recognizes with high accuracy document images containing mixed Gurmukhi and English text or only Gurmukhi/English text.