An optical character recognition system for printed Telugu text

  • Authors:
  • Vasantha Lakshmi;C. Patvardhan

  • Affiliations:
  • Dayalbagh Educational Institute, Department of Physics and Computer Science, 282005, Agra, India;Dayalbagh Educational Institute, Department of Electrical Engineering, 282005, Agra, India

  • Venue:
  • Pattern Analysis & Applications
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Telugu is one of the oldest and popular languages of India, spoken by more than 66 million people, especially in South India. Not much work has been reported on the development of optical character recognition (OCR) systems for Telugu text. Therefore, it is an area of current research. Some characters in Telugu are made up of more than one connected symbol. Compound characters are written by associating modifiers with consonants, resulting in a huge number of possible combinations, running into hundreds of thousands. A compound character may contain one or more connected symbols. Therefore, systems developed for documents of other scripts, like Roman, cannot be used directly for the Telugu language.The individual connected portions of a character or a compound character are defined as basic symbols in this paper and treated as a unit of recognition. The algorithms designed exploit special characteristics of Telugu script for processing the document images efficiently. The algorithms have been implemented to create a Telugu OCR system for printed text (TOSP). The output of TOSP is in phonetic English that can be transliterated to generate editable Telugu text. A special feature of TOSP is that it is designed to handle a large variety of sizes and multiple fonts, and still provides raw OCR accuracy of nearly 98%. The phonetic English representation can be also used to develop a Telugu text-to-speech system; work is in progress in this regard.