N-gram weighting: reducing training data mismatch in cross-domain language model estimation

Authors:
Bo-June (Paul) Hsu;James Glass
Affiliations:
MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA;MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA
Venue:
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2008

Citing 5
Cited 2

Distribution-based pruning of backoff language models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Minimum cut model for spoken lecture segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Style & topic language model adaptation using HMM-LDA

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Analysis and processing of lecture audio data: preliminary investigations

SpeechIR '04 Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004
MAP adaptation of stochastic grammars

Computer Speech and Language

Language models learning for domain-specific natural language user interaction

ROBIO'09 Proceedings of the 2009 international conference on Robotics and biomimetics
An empirical investigation of discounting in cross-domain language models

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2

Quantified Score

Hi-index	0.01

Visualization

Abstract

In domains with insufficient matched training data, language models are often constructed by interpolating component models trained from partially matched corpora. Since the n-grams from such corpora may not be of equal relevance to the target domain, we propose an n-gram weighting technique to adjust the component n-gram probabilities based on features derived from readily available segmentation and metadata information for each corpus. Using a log-linear combination of such features, the resulting model achieves up to a 1.2% absolute word error rate reduction over a linearly interpolated baseline language model on a lecture transcription task.