Language model adaptation for automatic speech recognition and statistical machine translation

  • Authors:
  • Sanjeev Khudanpur;Woosung Kim

  • Affiliations:
  • The Johns Hopkins University;The Johns Hopkins University

  • Venue:
  • Language model adaptation for automatic speech recognition and statistical machine translation
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Language modeling is critical and indispensable for many natural language applications such as automatic speech recognition and machine translation. Due to the complexity of natural language grammars, it is almost impossible to construct language models by a set of linguistic rules; therefore statistical techniques have been dominant for language modeling over the last few decades. All statistical modeling techniques, in principle, work under some conditions: (1) a reasonable amount of training data is available and (2) the training data comes from the same population as the test data to which we want to apply our model. Based on observations from the training data, we build statistical models and therefore, the success of a statistical model is crucially dependent on the training data. In other words, if we don't have enough data for training, or the training data is not matched with the test data, we are not able to build accurate statistical models. This thesis presents novel methods to cope with those problems in language modeling—language model adaptation. We first tackle the data deficiency problem for languages in which extensive text collections are not available. We propose methods to take advantage of a resource-rich language such as English, utilizing cross-lingual information retrieval followed by machine translation, to adapt language models for the resource-deficient language. By exploiting a copious side-corpus of contemporaneous articles in English to adapt the language model for a resource-deficient language; significant improvements in speech recognition accuracy are achieved. We next experiment with language model adaptation in English, which is resource-rich, in a different application: statistical machine translation. Regardless of its size, having training data that is not matched with the test data, which is of main interest, does not necessarily lead to accurate statistical models. Rather, we select small but effective texts by information retrieval, and use them for adaptation. Experimental results show that our adaptation techniques are effective for statistical machine translation as well.