A Database for Arabic Printed Character Recognition

Authors:
Ashraf Abdelraouf;Colin A Higgins;Mahmoud Khalil
Affiliations:
School of Computer Science, The University of Nottingham, Nottingham, UK and Faculty of Computer Science, Misr International University, Cairo, Egypt;School of Computer Science, The University of Nottingham, Nottingham, UK;Faculty of Engineering, Ain Shams University, Cairo, Egypt
Venue:
ICIAR '08 Proceedings of the 5th international conference on Image Analysis and Recognition
Year:
2008

Citing 7
Cited 0

Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Modelling Classification Performance for Large Data Sets

WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
Off Line Arabic Character Recognition - A Survey

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
A Data Base for Arabic Handwritten Text Recognition Research

IWFHR '02 Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR'02)
Arabic finite-state morphological analysis and generation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Offline Arabic Handwriting Recognition: A Survey

IEEE Transactions on Pattern Analysis and Machine Intelligence
The architecture of a standard Arabic lexical database: some figures, ratios and categories from the DIINAR.1 source program

Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages

Quantified Score

Hi-index	0.00

Visualization

Abstract

Electronic Document Management (EDM) technology is being widely adopted as it makes for the efficient routing and retrieval of documents. Optical Character Recognition (OCR) is an important front end for such technology. Excellent OCR now exists for Latin based languages, but there are few systems that read Arabic, which limits the penetration of EDM into Arabic-speaking countries. In developing an OCR system for Arabic it is necessary to create a database of Arabic words. Such a database has many uses as well as in training and testing a recognition system. This paper provides a comprehensive study and analysis of Arabic words and explains how such a database was constructed. Unlike earlier studies, this paper describes a database developed using a large number of collected Arabic words (6 million). It also considers connected segments or Pieces of Arabic Words (PAWs) as well as Naked Pieces of Arabic Word (NPAWs); PAWS without diacritics. Background information concerning the Arabic language is also presented.