Effect of OCR error correction on Arabic retrieval

Authors:
Walid Magdy;Kareem Darwish
Affiliations:
Cairo Microsoft Innovation Center, Abou Rawash, Egypt;Cairo Microsoft Innovation Center, Abou Rawash, Egypt
Venue:
Information Retrieval
Year:
2008

Citing 27
Cited 3

Overview of the first TREC conference

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Comparing words, stems, and roots as index terms in an Arabic Information Retrieval System

Journal of the American Society for Information Science
Evaluation of model-based retrieval effectiveness with OCR text

ACM Transactions on Information Systems (TOIS)
Effects of OCR errors on ranking and feedback using the vector space model

Information Processing and Management: an International Journal
Validation of Image Defect Models for Optical Character Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction

Computational Linguistics
Document degradation models and a methodology for degradation model validation

Document degradation models and a methodology for degradation model validation
Degraded text recognition using visual and linguistic context

Degraded text recognition using visual and linguistic context
The indexing and retrieval of document images: a survey

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents

IEEE Transactions on Pattern Analysis and Machine Intelligence
Stemming methodologies over individual query words for an Arabic information retrieval system

Journal of the American Society for Information Science
A Statistical, Nonparametric Methodology for Document Degradation Model Validation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Term selection for searching printed Arabic

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval can Cope with Many Errors

Information Retrieval
The Retrieval of Document Images: A Brief Survey

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
A Faster Algorithm for Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Probabilistic structured query methods

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Probabilistic methods for searching ocr-degraded arabic text

Probabilistic methods for searching ocr-degraded arabic text
Towards a single proposal in spelling correction

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Forming test collections with no system pooling

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Combining the language model and inference network approaches to retrieval

Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
A morphologically sensitive clustering algorithm for identifying Arabic roots

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Information Processing and Management: an International Journal
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Arabic OCR error correction using character segment correction, language modeling, and shallow morphology

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

An automatic linking service of document images reducing the effects of OCR errors with latent semantics

Proceedings of the 2010 ACM Symposium on Applied Computing
Managing misspelled queries in IR applications

Information Processing and Management: an International Journal
An unsupervised and data-driven approach for spell checking in Vietnamese OCR-scanned texts

HYBRID '12 Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data

Quantified Score

Hi-index	0.01

Visualization

Abstract

Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.