Developing typewritten Arabic corpus with multi-fonts (TRACOM)

  • Authors:
  • Mohammed S. Khorsheed;Khaled M. Alhazmi;Adil M. Asiri

  • Affiliations:
  • King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia;King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia;King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia

  • Venue:
  • Proceedings of the International Workshop on Multilingual OCR
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Amongst the obstacles that have played an important role in delaying the character recognition systems for Arabic language as compared to other languages such as Latin and Chinese is the absence of support utilities such as a language corpus and electronic dictionaries. This paper aims to develop a diverse corpus of scanned page images with the corresponding ground-truth text and description files. This data is a TypewRitten Arabic Corpus with Multi-fonts and referred to as TRACOM. TRACOM may also serve as a benchmark for assessing the performance of Arabic text recognition system. The corpus includes data from the following sources: computer-generated documents, newspapers, magazines, books. The document images are coupled with the equivalent text i.e., ground-truth.