Overview of the first TREC conference
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Comparing words, stems, and roots as index terms in an Arabic Information Retrieval System
Journal of the American Society for Information Science
Evaluation of model-based retrieval effectiveness with OCR text
ACM Transactions on Information Systems (TOIS)
Effects of OCR errors on ranking and feedback using the vector space model
Information Processing and Management: an International Journal
Validation of Image Defect Models for Optical Character Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
Document degradation models and a methodology for degradation model validation
Document degradation models and a methodology for degradation model validation
Degraded text recognition using visual and linguistic context
Degraded text recognition using visual and linguistic context
The indexing and retrieval of document images: a survey
Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents
IEEE Transactions on Pattern Analysis and Machine Intelligence
Stemming methodologies over individual query words for an Arabic information retrieval system
Journal of the American Society for Information Science
A Statistical, Nonparametric Methodology for Document Degradation Model Validation
IEEE Transactions on Pattern Analysis and Machine Intelligence
Term selection for searching printed Arabic
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval can Cope with Many Errors
Information Retrieval
The Retrieval of Document Images: A Brief Survey
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Probabilistic Retrieval of OCR Degraded Text Using N-Grams
ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
A Faster Algorithm for Approximate String Matching
CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Probabilistic structured query methods
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Probabilistic methods for searching ocr-degraded arabic text
Probabilistic methods for searching ocr-degraded arabic text
Towards a single proposal in spelling correction
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Forming test collections with no system pooling
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Combining the language model and inference network approaches to retrieval
Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
A morphologically sensitive clustering algorithm for identifying Arabic roots
ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Information Processing and Management: an International Journal
Speech and Language Processing (2nd Edition)
Speech and Language Processing (2nd Edition)
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Proceedings of the 2010 ACM Symposium on Applied Computing
Managing misspelled queries in IR applications
Information Processing and Management: an International Journal
An unsupervised and data-driven approach for spell checking in Vietnamese OCR-scanned texts
HYBRID '12 Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data
Hi-index | 0.01 |
Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.