Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

  • Authors:
  • Airenas Vaičiūnas;Vytautas Kaminskas;Gailius Raškinis

  • Affiliations:
  • Department of Applied Informatics, Vytautas Magnus University, Vileikos 8, LT-3035 Kaunas, Lithuania, e-mail: airenas@freemail.lt, V.Kaminskas@if.vdu.lt;Department of Applied Informatics, Vytautas Magnus University, Vileikos 8, LT-3035 Kaunas, Lithuania, e-mail: airenas@freemail.lt, V.Kaminskas@if.vdu.lt;Center of Computational Linguistics, Vytautas Magnus University, Donelaičio 52, LT-3000 Kaunas, Lithuania, e-mail: idgara@vdu.lt

  • Venue:
  • Informatica
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes our research on statistical languagemodeling of Lithuanian. The idea of improving sparse n-gram modelsof highly inflected Lithuanian language by interpolating them withcomplex n-gram models based on word clustering and morphologicalword decomposition was investigated. Words, word base forms andpart-of-speech tags were clustered into 50 to 5000 automaticallygenerated classes. Multiple 3-gram and 4-gram class-based languagemodels were built and evaluated on Lithuanian text corpus, whichcontained 85 million words. Class-based models linearlyinterpolated with the 3-gram model led up to a 13% reduction in theperplexity compared with the baseline 3-gram model. Morphologicalmodels decreased out-of-vocabulary word rate from 1.5% to1.02%.