Comparing words, stems, and roots as index terms in an Arabic Information Retrieval System
Journal of the American Society for Information Science
Degraded text recognition using visual and linguistic context
Degraded text recognition using visual and linguistic context
Stemming methodologies over individual query words for an Arabic information retrieval system
Journal of the American Society for Information Science
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Term selection for searching printed Arabic
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic Retrieval of OCR Degraded Text Using N-Grams
ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
A Faster Algorithm for Approximate String Matching
CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Probabilistic structured query methods
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Towards a single proposal in spelling correction
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A morphologically sensitive clustering algorithm for identifying Arabic roots
ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Error correction vs. query garbling for Arabic OCR document retrieval
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR’ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.