Synthetic Data for Arabic OCR System Development

Authors:
Affiliations:
Venue:
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Year:
2001

Citing 0
Cited 7

Recognition of Cursive Roman Handwriting - Past, Present and Future

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
A novel minimal Arabic script for preparing databases and benchmarks for Arabic text recognition research

WAV'09 Proceedings of the 3rd WSEAS international symposium on Wavelets theory and applications in applied mathematics, signal processing & modern science
Databases and competitions: strategies to improve Arabic recognition systems

SACH'06 Proceedings of the 2006 conference on Arabic and Chinese handwriting recognition
Printed Arabic text database (PATDB) for research and benchmarking

ACE'10 Proceedings of the 9th WSEAS international conference on Applications of computer engineering
Benchmark database and GUI environment for printed Arabic text recognition research

WSEAS Transactions on Information Science and Applications
Synthetic on-line signature generation. Part I: Methodology and algorithms

Pattern Recognition
Generation of training database using a noise model for OCR systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: A system for the automatic generation of synthetic databases for the development or evaluation of Arabic word or text recognition systems (Arabic OCR) is presented. The proposed system works without any scanning of printed paper. Firstly Arabic text has to be typeset using a standard typesetting system. Secondly a noise-free bitmap of the document and the corresponding ground truth (GT) is automatically generated. Finally, an image distortion can be superimposed to the character or word image to simulate the expected real world noise of the intended application. All necessary modules are presented together with some examples. Special problems caused by specific features of Arabic, such as printing from right to left, many diacritical points, variation in the height of characters, and changes in the relative position to the writing line, are suggested. The synthetic data set was used to train and test a recognition system based on Hidden Markov Model (HMM), which was originally developed for German cursive script, for Arabic printed words. Recognition results with different synthetic data sets are presented.