Recognition of Nastalique Urdu ligatures

  • Authors:
  • Gurpreet Singh Lehal;Ankur Rana

  • Affiliations:
  • Punjabi University, Patiala, India;Punjabi University, Patiala, India

  • Venue:
  • Proceedings of the 4th International Workshop on Multilingual OCR
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. There are more than 25,000 Urdu ligatures, out of which top 4567 ligatures account for 99% of coverage. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. In this paper, we have presented a system to recognize 9262 ligatures formed from 2190 primary and 17 secondary components. Various combinations of DCT, Gabor filters and zoning based features along with kNN, HMM and SVM classifiers have been tried and a recognition accuracy of 98% has been reported on pre-segmented ligatures.