Modeling morphologically rich languages using split words and unstructured dependencies

Authors:
Deniz Yuret;Ergun Biçici
Affiliations:
Koç University, Sariyer, Istanbul, Turkey;Koç University, Sariyer, Istanbul, Turkey
Venue:
ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Year:
2009

Citing 6
Cited 3

A statistical approach to machine translation

Computational Linguistics
An estimate of an upper bound for the entropy of English

Computational Linguistics
Learning morphological disambiguation rules for Turkish

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Morph-based speech recognition and modeling of out-of-vocabulary words across languages

ACM Transactions on Speech and Language Processing (TSLP)
KU: word sense disambiguation by substitution

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
USYD: WSD and lexical substitution using the Web1T corpus

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations

Automatic Sanskrit segmentizer using finite state transducers

ACLstudent '10 Proceedings of the ACL 2010 Student Research Workshop
Improved modeling of out-of-vocabulary words using morphological classes

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
A comparative investigation of morphological language modeling for the languages of the European union

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

We experiment with splitting words into their stem and suffix components for modeling morphologically rich languages. We show that using a morphological analyzer and disambiguator results in a significant perplexity reduction in Turkish. We present flexible n-gram models, Flex-Grams, which assume that the n -- 1 tokens that determine the probability of a given token can be chosen anywhere in the sentence rather than the preceding n -- 1 positions. Our final model achieves 27% perplexity reduction compared to the standard n-gram model.