Cache-based Statistical Language Models of English and Highly Inflected Lithuanian

  • Authors:
  • Airenas Vaičiūnas;Gailius Raškinis

  • Affiliations:
  • Department of Applied Informatics, Vytautas Magnus University, Vileikos 8, LT-44404 Kaunas, Lithuania, e-mail: airenas@freemail.lt, g.raskinis@if.vdu.lt;Department of Applied Informatics, Vytautas Magnus University, Vileikos 8, LT-44404 Kaunas, Lithuania, e-mail: airenas@freemail.lt, g.raskinis@if.vdu.lt

  • Venue:
  • Informatica
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper investigates a variety of statistical cache-basedlanguage models built upon three corpora: English, Lithuanian, andLithuanian base forms. The impact of the cache size, type of thedecay function, including custom corpus derived functions, andinterpolation technique (static vs. dynamic) on the perplexity of alanguage model is studied. The best results are achieved by modelsconsisting of 3 components: standard 3-gram, decaying cache 1-gramand decaying cache 2-gram that are joined together by means oflinear interpolation using the technique of dynamic weight update.Such a model led up to 36% and 43% perplexity improvement withrespect to the 3-gram baseline for Lithuanian words and Lithuanianword base forms respectively. The best language model of Englishled up to a 16% perplexity improvement. This suggests thatcache-based modeling is of greater utility for the free word orderhighly inflected languages.