Choosing an optimal architecture for segmentation and POS-tagging of modern Hebrew

Authors:
Roy Bar-Haim;Khalil Sima'an;Yoad Winter
Affiliations:
Bar-Ilan University, Ramat-Gan, Israel;Universiteit van Amsterdam, Amsterdam, The Netherlands;Haifa, Israel
Venue:
Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Year:
2005

Citing 5
Cited 9

Learning morpho-lexical probabilities from an untagged corpus with an application to Hebrew

Computational Linguistics
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Language model based arabic word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Automatic tagging of Arabic text: from raw text to base phrase chunks

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers

An unsupervised morpheme-based HMM for hebrew morphological disambiguation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A finite-state morphological grammar of hebrew

Natural Language Engineering
Part-of-speech tagging of modern hebrew text

Natural Language Engineering
Unsupervised concept discovery in Hebrew using simple unsupervised word prefix segmentation for Hebrew and Arabic

Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
Smoothing a lexicon-based POS tagger for Arabic and Hebrew

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Extraction of multi-word expressions from small parallel corpora

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
A comparison of segmentation methods and extended lexicon models for Arabic statistical machine translation

Machine Translation
Extraction of multi-word expressions from small parallel corpora

Natural Language Engineering
Word segmentation, unknown-word resolution, and morphological agreement in a hebrew parsing system

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

A major architectural decision in designing a disambiguation model for segmentation and Part-of-Speech (POS) tagging in Semitic languages concerns the choice of the input-output terminal symbols over which the probability distributions are defined. In this paper we develop a segmenter and a tagger for Hebrew based on Hidden Markov Models (HMMs). We start out from a morphological analyzer and a very small morphologically annotated corpus. We show that a model whose terminal symbols are word segments (=morphemes), is advantageous over a word-level model for the task of POS tagging. However, for segmentation alone, the morpheme-level model has no significant advantage over the word-level model. Error analysis shows that both models are not adequate for resolving a common type of segmentation ambiguity in Hebrew -- whether or not a word in a written text is prefixed by a definiteness marker. Hence, we propose a morpheme-level model where the definiteness morpheme is treated as a possible feature of morpheme terminals. This model exhibits the best overall performance, both in POS tagging and in segmentation. Despite the small size of the annotated corpus available for Hebrew, the results achieved using our best model are on par with recent results on Modern Standard Arabic.