Unsupervised learning of Arabic stemming using a parallel corpus

Authors:
Monica Rogati;Scott McCarley;Yiming Yang
Affiliations:
Carnegie Mellon University;IBM TJ Watson;Carnegie Mellon University
Venue:
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Year:
2003

Citing 6
Cited 12

Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Unsupervised learning of the morphology of a natural language

Computational Linguistics
An unsupervised method for word sense tagging using parallel corpora

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Language model based arabic word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Knowledge-free induction of morphology using latent semantic analysis

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7

Stemming to improve translation lexicon creation form bitexts

Information Processing and Management: an International Journal
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
YASS: Yet another suffix stripper

ACM Transactions on Information Systems (TOIS)
Part-of-speech tagging of modern hebrew text

Natural Language Engineering
Cross-lingual propagation for morphological analysis

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
ISI's participation in the Romanian-English alignment task

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
An extensible crosslinguistic readability framework

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Posterior Regularization for Structured Latent Variable Models

The Journal of Machine Learning Research
Enhancing mention detection using projection via aligned corpora

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
An accuracy-enhanced light stemmer for arabic text

ACM Transactions on Speech and Language Processing (TSLP)
Poor man’s stemming: unsupervised recognition of same-stem words

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Aligned-Parallel-Corpora Based Semi-Supervised Learning for Arabic Mention Detection

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10 K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Monolingual, unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. Examples and results will be given for Arabic, but the approach is applicable to any language that needs affix removal. Our resource-frugal approach results in 87.5% agreement with a state of the art, proprietary Arabic stemmer built using rules, affix lists, and human annotated text, in addition to an unsupervised component. Task-based evaluation using Arabic information retrieval indicates an improvement of 22-38% in average precision over unstemmed text, and 96% of the performance of the proprietary stemmer above.