Joint and coupled bilingual topic model based sentence representations for language model adaptation

  • Authors:
  • Shixiang Lu;Xiaoyin Fu;Wei Wei;Xingyuan Peng;Bo Xu

  • Affiliations:
  • Interactive Digital Media Technology Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China;Interactive Digital Media Technology Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China;Interactive Digital Media Technology Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China;Interactive Digital Media Technology Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China;Interactive Digital Media Technology Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China

  • Venue:
  • IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper is concerned with data selection for adapting language model (LM) in statistical machine translation (SMT), and aims to find the LM training sentences that are topic similar to the translation task. Although the traditional approaches have gained significant performance, they ignore the topic information and the distribution information of words when selecting similar training sentences. In this paper, we present two bilingual topic model (BLTM) (joint and coupled BLTM) based sentence representations for cross-lingual data selection. We map the data selection task into cross-lingual semantic representations that are language independent, then rank and select sentences in the target language LM training corpus for a sentence in the translation task by the semanticsbased likelihood. The semantic representations are learned from the parallel corpus, with the assumption that the bilingual pair shares the same or similar distribution over semantic topics. Large-scale experimental results demonstrate that our approaches significantly outperform the state-of-the-art approaches on both LM perplexity and translation performance, respectively.