Arabic OCR error correction using character segment correction, language modeling, and shallow morphology

Authors:
Walid Magdy;Kareem Darwish
Affiliations:
IBM Technology Development Center, El-Ahram, Giza, Egypt;IBM Technology Development Center, El-Ahram, Giza, Egypt
Venue:
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Year:
2006

Citing 12
Cited 7

Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction

Computational Linguistics
Degraded text recognition using visual and linguistic context

Degraded text recognition using visual and linguistic context
Term selection for searching printed Arabic

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
A Faster Algorithm for Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Towards a single proposal in spelling correction

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Language model based arabic word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A morphologically sensitive clustering algorithm for identifying Arabic roots

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Examining the effect of improved context sensitive morphology on Arabic information retrieval

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages

Error correction vs. query garbling for Arabic OCR document retrieval

ACM Transactions on Information Systems (TOIS)
Effect of OCR error correction on Arabic retrieval

Information Retrieval
Improving optical character recognition through efficient multiple system alignment

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Efficient Language-Independent Retrieval of Printed Documents without OCR

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Using deep morphology to improve automatic error detection in Arabic handwriting recognition

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
An unsupervised and data-driven approach for spell checking in Vietnamese OCR-scanned texts

HYBRID '12 Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data
From static to dynamic ensemble of classifiers selection: Application to Arabic handwritten recognition

International Journal of Knowledge-based and Intelligent Engineering Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores the use of a character segment based character correction model, language modeling, and shallow morphology for Arabic OCR error correction. Experimentation shows that character segment based correction is superior to single character correction and that language modeling boosts correction, by improving the ranking of candidate corrections, while shallow morphology had a small adverse effect. Further, given sufficiently large corpus to extract a dictionary and to train a language model, word based correction works well for a morphologically rich language such as Arabic.