Synthetic Data for Arabic OCR System Development

  • Authors:
  • Affiliations:
  • Venue:
  • ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Abstract: A system for the automatic generation of synthetic databases for the development or evaluation of Arabic word or text recognition systems (Arabic OCR) is presented. The proposed system works without any scanning of printed paper. Firstly Arabic text has to be typeset using a standard typesetting system. Secondly a noise-free bitmap of the document and the corresponding ground truth (GT) is automatically generated. Finally, an image distortion can be superimposed to the character or word image to simulate the expected real world noise of the intended application. All necessary modules are presented together with some examples. Special problems caused by specific features of Arabic, such as printing from right to left, many diacritical points, variation in the height of characters, and changes in the relative position to the writing line, are suggested. The synthetic data set was used to train and test a recognition system based on Hidden Markov Model (HMM), which was originally developed for German cursive script, for Arabic printed words. Recognition results with different synthetic data sets are presented.