Intelligent selection of language model training data

Authors:
Robert C. Moore;William Lewis
Affiliations:
Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Year:
2010

Citing 2
Cited 18

Toward a unified approach to statistical language modeling for Chinese

ACM Transactions on Asian Language Information Processing (TALIP)
Learning classifiers from only positive and unlabeled data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Effective measures of domain similarity for parsing

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
An empirical investigation of discounting in cross-domain language models

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
The RWTH Aachen machine translation system for WMT 2011

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Domain adaptation via pseudo in-domain data selection

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
The imagination of crowds: conversational AAC language modeling using crowdsourcing and large data sources

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Does more data always yield better translations?

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Adapting translation models to translationese improves SMT

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Perplexity minimization for translation model domain adaptation in statistical machine translation

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Large, pruned or continuous space language models on a GPU for statistical machine translation

WLM '12 Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT
Translation model based cross-lingual language model adaptation: from word models to phrase models

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Applying prediction techniques to phoneme-based AAC systems

SLPAT '12 Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies
The RWTH Aachen machine translation system for WMT 2012

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Joint WMT 2012 submission of the QUAERO project

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
LIUM's SMT machine translation systems for WMT 2012

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Selecting data for English-to-Czech machine translation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
DFKI's SMT system for WMT 2012

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Improving statistical machine translation by adapting translation models to translationese

Computational Linguistics
Improving statistical machine translation by adapting translation models to translationese

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of selecting non-domain-specific language model training data to build auxiliary language models for use in tasks such as machine translation. Our approach is based on comparing the cross-entropy, according to domain-specific and non-domain-specifc language models, for each sentence of the text source used to produce the latter language model. We show that this produces better language models, trained on less data, than both random data selection and two other previously proposed methods.