Intelligent selection of language model training data

  • Authors:
  • Robert C. Moore;William Lewis

  • Affiliations:
  • Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA

  • Venue:
  • ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We address the problem of selecting non-domain-specific language model training data to build auxiliary language models for use in tasks such as machine translation. Our approach is based on comparing the cross-entropy, according to domain-specific and non-domain-specifc language models, for each sentence of the text source used to produce the latter language model. We show that this produces better language models, trained on less data, than both random data selection and two other previously proposed methods.