Comparing words, stems, and roots as index terms in an Arabic Information Retrieval System
Journal of the American Society for Information Science
Document degradation models and a methodology for degradation model validation
Document degradation models and a methodology for degradation model validation
Design and implementation of automatic indexing for information retrieval with Arabic documents
Journal of the American Society for Information Science
Probabilistic Retrieval of OCR Degraded Text Using N-Grams
ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Arabic finite-state morphological analysis and generation
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Building a shallow Arabic Morphological Analyzer in one day
SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
Probabilistic structured query methods
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Arabic morphological analysis techniques: a comprehensive survey
Journal of the American Society for Information Science and Technology
Character contiguity in N-gram-based word matching: the case for Arabic text searching
Information Processing and Management: an International Journal
Error correction vs. query garbling for Arabic OCR document retrieval
ACM Transactions on Information Systems (TOIS)
A general framework for multilingual text mining using self-organizing maps
AIAP'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications
Effect of OCR error correction on Arabic retrieval
Information Retrieval
CMIC at INEX 2007: Book Search Track
Focused Access to XML Documents
Identifying semitic roots: Machine learning with linguistic constraints
Computational Linguistics
Book search: indexing the valuable parts
Proceedings of the 2008 ACM workshop on Research advances in large digital book repositories
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Efficient Language-Independent Retrieval of Printed Documents without OCR
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Examining the effect of improved context sensitive morphology on Arabic information retrieval
Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Managing misspelled queries in IR applications
Information Processing and Management: an International Journal
Using deep morphology to improve automatic error detection in Arabic handwriting recognition
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Word-Based correction for retrieval of arabic OCR degraded documents
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Stemming arabic conjunctions and prepositions
SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Arabic retrieval revisited: morphological hole filling
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Hi-index | 0.00 |
Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.