Capturing out-of-vocabulary words in Arabic text

Authors:
Abdusalam F. A. Nwesri;S. M. M. Tahaghoghi;Falk Scholer
Affiliations:
RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia
Venue:
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Year:
2006

Citing 6
Cited 5

Finding approximate matches in large lexicons

Software—Practice & Experience
Effective foreign word extration for Korean information retrieval

Information Processing and Management: an International Journal
On arabic search: improving the retrieval effectiveness via a light stemming approach

Proceedings of the eleventh international conference on Information and knowledge management
Statistical transliteration for english-arabic cross language information retrieval

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Arabic Stemming Without A Root Dictionary

ITCC '05 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume I - Volume 01
Translating names and technical terms in Arabic text

Semitic '98 Proceedings of the Workshop on Computational Approaches to Semitic Languages

A Method for Recognizing Noisy Romanized Japanese Words in Learner English

IEICE - Transactions on Information and Systems
Recognizing noisy romanized Japanese words in learner English

EANL '08 Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications
Finding variants of out-of-vocabulary words in Arabic

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Identification of transliterated foreign words in Hebrew script

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
The Effect of Stemming on Arabic Text Classification: An Empirical Study

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing flow of information between languages has led to a rise in the frequency of non-native or loan words, where terms of one language appear transliterated in another. Dealing with such out of vocabulary words is essential for successful cross-lingual information retrieval. For example, techniques such as stemming should not be applied indiscriminately to all words in a collection, and so before any stemming, foreign words need to be identified. In this paper, we investigate three approaches for the identification of foreign words in Arabic text: lexicons, language patterns, and n-grams and present that results show that lexicon-based approaches outperform the other techniques.