Term selection for searching printed Arabic

Authors:
Kareem Darwish;Douglas W. Oard
Affiliations:
University of Maryland, College Park, College Park, MD;University of Maryland, College Park, College Park, MD
Venue:
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2002

Citing 6
Cited 17

Comparing words, stems, and roots as index terms in an Arabic Information Retrieval System

Journal of the American Society for Information Science
Document degradation models and a methodology for degradation model validation

Document degradation models and a methodology for degradation model validation
Design and implementation of automatic indexing for information retrieval with Arabic documents

Journal of the American Society for Information Science
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Arabic finite-state morphological analysis and generation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Building a shallow Arabic Morphological Analyzer in one day

SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages

Probabilistic structured query methods

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Arabic morphological analysis techniques: a comprehensive survey

Journal of the American Society for Information Science and Technology
Character contiguity in N-gram-based word matching: the case for Arabic text searching

Information Processing and Management: an International Journal
Error correction vs. query garbling for Arabic OCR document retrieval

ACM Transactions on Information Systems (TOIS)
A general framework for multilingual text mining using self-organizing maps

AIAP'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications
Effect of OCR error correction on Arabic retrieval

Information Retrieval
CMIC at INEX 2007: Book Search Track

Focused Access to XML Documents
Identifying semitic roots: Machine learning with linguistic constraints

Computational Linguistics
Book search: indexing the valuable parts

Proceedings of the 2008 ACM workshop on Research advances in large digital book repositories
Arabic OCR error correction using character segment correction, language modeling, and shallow morphology

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Efficient Language-Independent Retrieval of Printed Documents without OCR

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Examining the effect of improved context sensitive morphology on Arabic information retrieval

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Managing misspelled queries in IR applications

Information Processing and Management: an International Journal
Using deep morphology to improve automatic error detection in Arabic handwriting recognition

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Word-Based correction for retrieval of arabic OCR degraded documents

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Stemming arabic conjunctions and prepositions

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Arabic retrieval revisited: morphological hole filling

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.