Smoothing a lexicon-based POS tagger for Arabic and Hebrew

Authors:
Saib Mansour;Khalil Sima'an;Yoad Winter
Affiliations:
Computer Science, Haifa, Israel;Universiteit van Amsterdam, Amsterdam, The Netherlands;Computer Science, Haifa, Israel and Netherlands Institute for Advanced Study, Wassenaar, The Netherlands
Venue:
Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Year:
2007

Citing 10
Cited 3

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Coping with ambiguity and unknown words through probabilistic models

Computational Linguistics - Special issue on using large corpora: II
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Language model based arabic word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Chinese and Japanese word segmentation using word-level and character-level information

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Automatic tagging of Arabic text: from raw text to base phrase chunks

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Choosing an optimal architecture for segmentation and POS-tagging of modern Hebrew

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Developing an Arabic treebank: methods, guidelines, procedures, and tools

Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages

A probabilistic morphological analyzer for Syriac

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
An efficient part-of-speech tagger for arabic

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
A comparison of segmentation methods and extended lexicon models for Arabic statistical machine translation

Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose an enhanced Part-of-Speech (POS) tagger of Semitic languages that treats Modern Standard Arabic (henceforth Arabic) and Modern Hebrew (henceforth Hebrew) using the same probabilistic model and architectural setting. We start out by porting an existing Hidden Markov Model POS tagger for Hebrew to Arabic by exchanging a morphological analyzer for Hebrew with Buckwalter's (2002) morphological analyzer for Arabic. This gives state-of-the-art accuracy (96.12%), comparable to Habash and Rambow's (2005) analyzer-based POS tagger on the same Arabic datasets. However, further improvement of such analyzer-based tagging methods is hindered by the incomplete coverage of standard morphological analyzer (Bar Haim et al., 2005). To overcome this coverage problem we supplement the output of Buckwalter's analyzer with synthetically constructed analyses that are proposed by a model which uses character information (Diab et al., 2004) in a way that is similar to Nakagawa's (2004) system for Chinese and Japanese. A version of this extended model that (unlike Nakagawa) incorporates synthetically constructed analyses also for known words achieves 96.28% accuracy on the standard Arabic test set.