Grammatical category disambiguation by statistical optimization
Computational Linguistics
Automatic stochastic tagging of natural language texts
Computational Linguistics
Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
Coping with ambiguity and unknown words through probabilistic models
Computational Linguistics - Special issue on using large corpora: II
Tagging English text with a probabilistic model
Computational Linguistics
Unsupervised learning of the morphology of a natural language
Computational Linguistics
TnT: a statistical part-of-speech tagger
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A stochastic parts program and noun phrase parser for unrestricted text
ANLC '88 Proceedings of the second conference on Applied natural language processing
Does Baum-Welch re-estimation help taggers?
ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A practical part-of-speech tagger
ANLC '92 Proceedings of the third conference on Applied natural language processing
Statistical morphological disambiguation for agglutinative languages
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Unsupervised learning of Arabic stemming using a parallel corpus
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Language model based arabic word segmentation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Minimally supervised morphological analysis by multimodal alignment
ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Knowledge-free induction of morphology using latent semantic analysis
ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
An unsupervised morpheme-based HMM for hebrew morphological disambiguation
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Chinese and Japanese word segmentation using word-level and character-level information
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Automatic tagging of Arabic text: from raw text to base phrase chunks
HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Choosing an optimal architecture for segmentation and POS-tagging of modern Hebrew
Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Methods for Amharic part-of-speech tagging
AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Lightly supervised transliteration for machine translation
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Three-dimensional parametrization for parsing morphologically rich languages
IWPT '07 Proceedings of the 10th International Conference on Parsing Technologies
A probabilistic morphological analyzer for Syriac
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Identifying multi-word expressions by leveraging morphological and syntactic idiosyncrasy
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Automatic Treebank Conversion via Informed Decoding - A Case Study on Chinese Treebanks
ACM Transactions on Asian Language Information Processing (TALIP)
Statistical thesaurus construction for a morphologically rich language
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Joint evaluation of morphological segmentation and syntactic parsing
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Part of speech tagging for arabic
Natural Language Engineering
Word segmentation, unknown-word resolution, and morphological agreement in a hebrew parsing system
Computational Linguistics
Hi-index | 0.00 |
Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a part-of-speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If the tokenization is at the word level, the output tags must be complex, and represent both the segmentation of the word and the POS tag assigned to each word segment. If the tokenization is at the segment level, the input itself must encode the different alternative segmentations of the words, while the output consists of standard POS tags. Comparing these two alternatives is not trivial, as the choice between them may have global effects on the grammatical model. Moreover, intermediate levels of tokenization between these two extremes are conceivable, and, as we aim to show, beneficial. To the best of our knowledge, the problem of tokenization for POS tagging of Semitic languages has not been addressed before in full generality. In this paper, we study this problem for the purpose of POS tagging of Modern Hebrew texts. After extensive error analysis of the two simple tokenization models, we propose a novel, linguistically motivated, intermediate tokenization model that gives better performance for Hebrew over the two initial architectures. Our study is based on the well-known hidden Markov models (HMMs). We start out from a manually devised morphological analyzer and a very small annotated corpus, and describe how to adapt an HMM-based POS tagger for both tokenization architectures. We present an effective technique for smoothing the lexical probabilities using an untagged corpus, and a novel transformation for casting the segment-level tagger in terms of a standard, word-level HMM implementation. The results obtained using our model are on par with the best published results on Modern Standard Arabic, despite the much smaller annotated corpus available for Modern Hebrew.