A new approach to lexical disambiguation of Arabic text

Authors:
Rushin Shah;Paramveer S. Dhillon;Mark Liberman;Dean Foster;Mohamed Maamouri;Lyle Ungar
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA
Venue:
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Year:
2010

Citing 12
Cited 1

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Morphological tagging: data vs. dictionaries

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
An unsupervised morpheme-based HMM for hebrew morphological disambiguation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Context-based morphological disambiguation with random fields

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Automatic tagging of Arabic text: from raw text to base phrase chunks

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Unsupervised concept discovery in Hebrew using simple unsupervised word prefix segmentation for Hebrew and Arabic

Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
A global model for joint lemmatization and part-of-speech prediction

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Simultaneous tokenization and part-of-speech tagging for Arabic without a morphological analyzer

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers

Exploiting Separation of Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.01

Visualization

Abstract

We describe a model for the lexical analysis of Arabic text, using the lists of alternatives supplied by a broad-coverage morphological analyzer, SAMA, which include stable lemma IDs that correspond to combinations of broad word sense categories and POS tags. We break down each of the hundreds of thousands of possible lexical labels into its constituent elements, including lemma ID and part-of-speech. Features are computed for each lexical token based on its local and document-level context and used in a novel, simple, and highly efficient two-stage supervised machine learning algorithm that overcomes the extreme sparsity of label distribution in the training data. The resulting system achieves accuracy of 90.6% for its first choice, and 96.2% for its top two choices, in selecting among the alternatives provided by the SAMA lexical analyzer. We have successfully used this system in applications such as an online reading helper for intermediate learners of the Arabic language, and a tool for improving the productivity of Arabic Treebank annotators.