Word-Based correction for retrieval of arabic OCR degraded documents

Authors:
Walid Magdy;Kareem Darwish
Affiliations:
IBM Technology Development Center, El-Ahram, Giza, Egypt;IBM Technology Development Center, El-Ahram, Giza, Egypt
Venue:
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Year:
2006

Citing 12
Cited 1

Comparing words, stems, and roots as index terms in an Arabic Information Retrieval System

Journal of the American Society for Information Science
Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction

Computational Linguistics
Degraded text recognition using visual and linguistic context

Degraded text recognition using visual and linguistic context
Stemming methodologies over individual query words for an Arabic information retrieval system

Journal of the American Society for Information Science
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Term selection for searching printed Arabic

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
A Faster Algorithm for Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Probabilistic structured query methods

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Towards a single proposal in spelling correction

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A morphologically sensitive clustering algorithm for identifying Arabic roots

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

Error correction vs. query garbling for Arabic OCR document retrieval

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR’ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.