A probabilistic morphological analyzer for Syriac

Authors:
Peter McClanahan;George Busby;Robbie Haertel;Kristian Heal;Deryle Lonsdale;Kevin Seppi;Eric Ringger
Affiliations:
Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah
Venue:
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Year:
2010

Citing 15
Cited 0

Grammatical Agreement and Automatic Morphological Disambiguation of Inflectional Languages

TSD '01 Proceedings of the 4th International Conference on Text, Speech and Dialogue
Multitiered nonlinear morphology using multitape finite automata: a case study on Syriac and Arabic

Computational Linguistics - Special issue on finite-state methods in NLP
Tagging inflective languages: prediction of morphological categories for a rich, structured tagset

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Language model based arabic word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Context-based morphological disambiguation with random fields

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Part-of-speech tagging of modern hebrew text

Natural Language Engineering
Identifying semitic roots: Machine learning with linguistic constraints

Computational Linguistics
Solving the problem of cascading errors: approximate Bayesian inference for linguistic annotation pipelines

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Automatic tagging of Arabic text: from raw text to base phrase chunks

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Arabic diacritization through full morphological tagging

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Smoothing a lexicon-based POS tagger for Arabic and Hebrew

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Automatic diacritization for low-resource languages using a hybrid word and consonant CMM

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Is Arabic part of speech tagging feasible without word segmentation?

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We define a probabilistic morphological analyzer using a data-driven approach for Syriac in order to facilitate the creation of an annotated corpus. Syriac is an under-resourced Semitic language for which there are no available language tools such as morphological analyzers. We introduce novel probabilistic models for segmentation, dictionary linkage, and morphological tagging and connect them in a pipeline to create a probabilistic morphological analyzer requiring only labeled data. We explore the performance of models with varying amounts of training data and find that with about 34,500 labeled tokens, we can outperform a reasonable baseline trained on over 99,000 tokens and achieve an accuracy of just over 80%. When trained on all available training data, our joint model achieves 86.47% accuracy, a 29.7% reduction in error rate over the baseline.