N-gram weighting: reducing training data mismatch in cross-domain language model estimation

  • Authors:
  • Bo-June (Paul) Hsu;James Glass

  • Affiliations:
  • MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA;MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA

  • Venue:
  • EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

In domains with insufficient matched training data, language models are often constructed by interpolating component models trained from partially matched corpora. Since the n-grams from such corpora may not be of equal relevance to the target domain, we propose an n-gram weighting technique to adjust the component n-gram probabilities based on features derived from readily available segmentation and metadata information for each corpus. Using a log-linear combination of such features, the resulting model achieves up to a 1.2% absolute word error rate reduction over a linearly interpolated baseline language model on a lecture transcription task.