Language model adaptation with additional text generated by machine translation

Authors:
Hideharu Nakajima;Hirofumi Yamamoto;Taro Watanabe
Affiliations:
NTT Corporation, Kanagawa, Japan;ATR Spoken Language Translation Research Laboratories, Kyoto, Japan;ATR Spoken Language Translation Research Laboratories, Kyoto, Japan
Venue:
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Year:
2002

Citing 6
Cited 5

Task Adaptation Using MAP Estimation in N-Gram Language Modeling

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Word re-ordering and DP-based search in statistical machine translation

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
A comparison of alignment models for statistical machine translation

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Improved statistical alignment models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Language modeling with sentence-level mixtures

HLT '94 Proceedings of the workshop on Human Language Technology

Exploiting N-best hypotheses for SMT self-enhancement

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Language model adaptation using machine-translated text for resource-deficient languages

EURASIP Journal on Audio, Speech, and Music Processing
Language Modeling for Syntax-Based Machine Translation Using Tree Substitution Grammars: A Case Study on Chinese-English Translation

ACM Transactions on Asian Language Information Processing (TALIP)
Automatic speech recognition for under-resourced languages: A survey

Speech Communication
SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistical language modeling requires a large corpus for the application domain. When a large corpus is not available, the language model adaptation technique has often been used in the speech recognition research domain. This adaptation needs only a small corpus of the application domain (the "target corpus") and the corpus should be written in the language of the model. However, it is sometimes difficult to collect even a small corpus, especially of spoken language, due to its high cost. To address this problem, this paper proposes a novel scheme that generates a small target corpus in the language of the model by machine translation of the target corpus in another language. As information about adjacent words, which is necessary for a statistical language model, is stored in the translation knowledge, it can be extracted by machine translation and used for adaptation. Experiments showed that the language model improvement was about half of that which was obtained with a human collected corpus, and this provided some initial proof of the concept experiments.