Language and translation model adaptation using comparable corpora
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
On-line language model biasing for statistical machine translation
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Translation model based cross-lingual language model adaptation: from word models to phrase models
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Hi-index | 0.00 |
Language modeling is critical and indispensable for many natural language applications such as automatic speech recognition and machine translation. Due to the complexity of natural language grammars, it is almost impossible to construct language models by a set of linguistic rules; therefore statistical techniques have been dominant for language modeling over the last few decades. All statistical modeling techniques, in principle, work under some conditions: (1) a reasonable amount of training data is available and (2) the training data comes from the same population as the test data to which we want to apply our model. Based on observations from the training data, we build statistical models and therefore, the success of a statistical model is crucially dependent on the training data. In other words, if we don't have enough data for training, or the training data is not matched with the test data, we are not able to build accurate statistical models. This thesis presents novel methods to cope with those problems in language modeling—language model adaptation. We first tackle the data deficiency problem for languages in which extensive text collections are not available. We propose methods to take advantage of a resource-rich language such as English, utilizing cross-lingual information retrieval followed by machine translation, to adapt language models for the resource-deficient language. By exploiting a copious side-corpus of contemporaneous articles in English to adapt the language model for a resource-deficient language; significant improvements in speech recognition accuracy are achieved. We next experiment with language model adaptation in English, which is resource-rich, in a different application: statistical machine translation. Regardless of its size, having training data that is not matched with the test data, which is of main interest, does not necessarily lead to accurate statistical models. Rather, we select small but effective texts by information retrieval, and use them for adaptation. Experimental results show that our adaptation techniques are effective for statistical machine translation as well.