Error correction vs. query garbling for Arabic OCR document retrieval

Authors:
Kareem Darwish;Walid Magdy
Affiliations:
IBM Technology Development Center, Cairo, Abou Rawash, Egypt;IBM Technology Development Center, Cairo, Abou Rawash, Egypt
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2007

Citing 20
Cited 1

Comparing words, stems, and roots as index terms in an Arabic Information Retrieval System

Journal of the American Society for Information Science
Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction

Computational Linguistics
Degraded text recognition using visual and linguistic context

Degraded text recognition using visual and linguistic context
Stemming methodologies over individual query words for an Arabic information retrieval system

Journal of the American Society for Information Science
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Term selection for searching printed Arabic

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Translation-Based Indexing for Cross-Language Retrieval

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
The Retrieval of Document Images: A Brief Survey

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
A Faster Algorithm for Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Probabilistic structured query methods

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Towards a single proposal in spelling correction

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A morphologically sensitive clustering algorithm for identifying Arabic roots

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
The text retrieval conferences (TRECS)

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Arabic OCR error correction using character segment correction, language modeling, and shallow morphology

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Word-Based correction for retrieval of arabic OCR degraded documents

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval

Phrase-based query degradation modeling for vocabulary-independent ranked utterance retrieval

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the existence of large numbers of legacy documents (such as old books and newspapers), improving retrieval effectiveness for OCR'ed documents continues to be an important problem. This article compares the effect of OCR error correction with and without language modeling and the effect of query garbling with weighted structured queries on the retrieval of OCR degraded Arabic documents. The results suggest that moderate error correction does not yield statistically significant improvement in retrieval effectiveness when indexing and searching using n-grams. Also, reversing error correction models to perform query garbling in conjunction with weighted structured queries yields improved retrieval effectiveness. Lastly, using very good error correction that utilizes language modeling yields the best improvement in retrieval effectiveness.