An empirical investigation of discounting in cross-domain language models

Authors:
Greg Durrett;Dan Klein
Affiliations:
University of California, Berkeley;University of California, Berkeley
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Year:
2011

Citing 5
Cited 0

A hierarchical Bayesian language model based on Pitman-Yor processes

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
N-gram weighting: reducing training data mismatch in cross-domain language model estimation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
MAP adaptation of stochastic grammars

Computer Speech and Language
Improved smoothing for N-gram language models based on ordinary counts

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Intelligent selection of language model training data

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the empirical behavior of n-gram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines.