Part-of-speech tagging of modern hebrew text

Authors:
Roy Bar-haim;Khalil Sima'an;Yoad Winter
Affiliations:
Dept. of computer science, bar-ilan university, ramat-gan 52900, israel e-mail: barhair@cs.biu.ac.il;Institute for logic, language and computation, universiteit van amsterdam, amsterdam, the netherlandssimaan@science.uva.nl;Dept. of computer science, technion, haifa 32000, israelwinter@cs.technion.ac.il and netherlands institute for advanced study, meijboomlaan 1, 2242 pr wassenaar, the netherlands
Venue:
Natural Language Engineering
Year:
2008

Citing 21
Cited 10

Grammatical category disambiguation by statistical optimization

Computational Linguistics
Automatic stochastic tagging of natural language texts

Computational Linguistics
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Coping with ambiguity and unknown words through probabilistic models

Computational Linguistics - Special issue on using large corpora: II
Tagging English text with a probabilistic model

Computational Linguistics
Unsupervised learning of the morphology of a natural language

Computational Linguistics
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Does Baum-Welch re-estimation help taggers?

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Statistical morphological disambiguation for agglutinative languages

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Unsupervised learning of Arabic stemming using a parallel corpus

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Language model based arabic word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Minimally supervised morphological analysis by multimodal alignment

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Knowledge-free induction of morphology using latent semantic analysis

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
An unsupervised morpheme-based HMM for hebrew morphological disambiguation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Chinese and Japanese word segmentation using word-level and character-level information

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Automatic tagging of Arabic text: from raw text to base phrase chunks

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Choosing an optimal architecture for segmentation and POS-tagging of modern Hebrew

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages

Methods for Amharic part-of-speech tagging

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Lightly supervised transliteration for machine translation

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Three-dimensional parametrization for parsing morphologically rich languages

IWPT '07 Proceedings of the 10th International Conference on Parsing Technologies
A probabilistic morphological analyzer for Syriac

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Identifying multi-word expressions by leveraging morphological and syntactic idiosyncrasy

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Automatic Treebank Conversion via Informed Decoding - A Case Study on Chinese Treebanks

ACM Transactions on Asian Language Information Processing (TALIP)
Statistical thesaurus construction for a morphologically rich language

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Joint evaluation of morphological segmentation and syntactic parsing

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Part of speech tagging for arabic

Natural Language Engineering
Word segmentation, unknown-word resolution, and morphological agreement in a hebrew parsing system

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a part-of-speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If the tokenization is at the word level, the output tags must be complex, and represent both the segmentation of the word and the POS tag assigned to each word segment. If the tokenization is at the segment level, the input itself must encode the different alternative segmentations of the words, while the output consists of standard POS tags. Comparing these two alternatives is not trivial, as the choice between them may have global effects on the grammatical model. Moreover, intermediate levels of tokenization between these two extremes are conceivable, and, as we aim to show, beneficial. To the best of our knowledge, the problem of tokenization for POS tagging of Semitic languages has not been addressed before in full generality. In this paper, we study this problem for the purpose of POS tagging of Modern Hebrew texts. After extensive error analysis of the two simple tokenization models, we propose a novel, linguistically motivated, intermediate tokenization model that gives better performance for Hebrew over the two initial architectures. Our study is based on the well-known hidden Markov models (HMMs). We start out from a manually devised morphological analyzer and a very small annotated corpus, and describe how to adapt an HMM-based POS tagger for both tokenization architectures. We present an effective technique for smoothing the lexical probabilities using an untagged corpus, and a novel transformation for casting the segment-level tagger in terms of a standard, word-level HMM implementation. The results obtained using our model are on par with the best published results on Modern Standard Arabic, despite the much smaller annotated corpus available for Modern Hebrew.