Cache-based Statistical Language Models of English and Highly Inflected Lithuanian

Authors:
Airenas Vaičiūnas;Gailius Raškinis
Affiliations:
Department of Applied Informatics, Vytautas Magnus University, Vileikos 8, LT-44404 Kaunas, Lithuania, e-mail: airenas@freemail.lt, g.raskinis@if.vdu.lt;Department of Applied Informatics, Vytautas Magnus University, Vileikos 8, LT-44404 Kaunas, Lithuania, e-mail: airenas@freemail.lt, g.raskinis@if.vdu.lt
Venue:
Informatica
Year:
2006

Citing 6
Cited 2

A Cache-Based Natural Language Model for Speech Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
A dynamic language model for speech recognition

HLT '91 Proceedings of the workshop on Speech and Natural Language
Statistical methods for speech recognition

Statistical methods for speech recognition
Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Speech recognition and the frequency of recently used words: a modified Markov model for natural language

COLING '88 Proceedings of the 12th conference on Computational linguistics - Volume 1
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

Informatica

Multi-Alignment Templates Induction

Informatica
Reduction of Morpho-Syntactic Features in Statistical Machine Translation of Highly Inflective Language

Informatica

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates a variety of statistical cache-basedlanguage models built upon three corpora: English, Lithuanian, andLithuanian base forms. The impact of the cache size, type of thedecay function, including custom corpus derived functions, andinterpolation technique (static vs. dynamic) on the perplexity of alanguage model is studied. The best results are achieved by modelsconsisting of 3 components: standard 3-gram, decaying cache 1-gramand decaying cache 2-gram that are joined together by means oflinear interpolation using the technique of dynamic weight update.Such a model led up to 36% and 43% perplexity improvement withrespect to the 3-gram baseline for Lithuanian words and Lithuanianword base forms respectively. The best language model of Englishled up to a 16% perplexity improvement. This suggests thatcache-based modeling is of greater utility for the free word orderhighly inflected languages.