Choice of recognizable units for URDU OCR

  • Authors:
  • Gurpreet Singh Lehal

  • Affiliations:
  • Punjabi University, Patiala

  • Venue:
  • Proceeding of the workshop on Document Analysis and Recognition
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. A related issue is identification of all possible ligatures for recognition purpose. For this purpose, we have performed a statistical analysis of Urdu corpus to collect and organise the Urdu ligatures. The number of unique ligatures comes to be more than 26,000, and recognition of such a huge class is again a Herculean task. It becomes necessary to reduce the class count and look for alternative recognition unit. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. Further statistical analysis is performed to count and arrange in descending order the primary components and a manageable class of around 2300 recognition units has been generated, which covers 99% of Urdu corpus.