Morphology-Based Segmentation Combination for Arabic Mention Detection

Authors:
Yassine Benajiba;Imed Zitouni
Affiliations:
Center for Computational Learning Systems, Columbia University;IBM T. J. Watson Research Center
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2009

Citing 13
Cited 2

A maximum entropy approach to natural language processing

Computational Linguistics
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Rational Design for a Weighted Finite-State Transducer Library

WIA '97 Revised Papers from the Second International Workshop on Implementing Automata
Representing text chunks

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Sequential conditional Generalized Iterative Scaling

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Language model based arabic word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Introduction to the CoNLL-2002 shared task: language-independent named entity recognition

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
HowtogetaChineseName(Entity): segmentation and combination issues

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Combination of Arabic preprocessing schemes for statistical machine translation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Arabic named entity recognition using optimized feature sets

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Automatic tagging of Arabic text: from raw text to base phrase chunks

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
The impact of morphological stemming on Arabic mention detection and coreference resolution

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages

Cross-Language Information Propagation for Arabic Mention Detection

ACM Transactions on Asian Language Information Processing (TALIP)
Arabic Mention Detection: toward better unit of analysis

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Arabic language has a very rich/complex morphology. Each Arabic word is composed of zero or more prefixes, one stem and zero or more suffixes. Consequently, the Arabic data is sparse compared to other languages such as English, and it is necessary to conduct word segmentation before any natural language processing task. Therefore, the word-segmentation step is worth a deeper study since it is a preprocessing step which shall have a significant impact on all the steps coming afterward. In this article, we present an Arabic mention detection system that has very competitive results in the recent Automatic Content Extraction (ACE) evaluation campaign. We investigate the impact of different segmentation schemes on Arabic mention detection systems and we show how these systems may benefit from more than one segmentation scheme. We report the performance of several mention detection models using different kinds of possible and known segmentation schemes for Arabic text: punctuation separation, Arabic Treebank, and morphological and character-level segmentations. We show that the combination of competitive segmentation styles leads to a better performance. Results indicate a statistically significant improvement when Arabic Treebank and morphological segmentations are combined.