An empirical study on word segmentation for chinese machine translation

  • Authors:
  • Hai Zhao;Masao Utiyama;Eiichiro Sumita;Bao-Liang Lu

  • Affiliations:
  • MOE-Microsoft Key Laboratory of Intelligent Computing and Intelligent System, Shanghai Jiao Tong University, Shanghai, China,Department of Computer Science and Engineering, Shanghai Jiao Tong Univ ...;Multilingual Translation Laboratory, MASTAR Project, National Institute of Information and Communications Technology, Keihanna Science City, Kyoto, Japan;Multilingual Translation Laboratory, MASTAR Project, National Institute of Information and Communications Technology, Keihanna Science City, Kyoto, Japan;MOE-Microsoft Key Laboratory of Intelligent Computing and Intelligent System, Shanghai Jiao Tong University, Shanghai, China,Department of Computer Science and Engineering, Shanghai Jiao Tong Univ ...

  • Venue:
  • CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Word segmentation has been shown helpful for Chinese-to-English machine translation (MT), yet the way different segmentation strategies affect MT is poorly understood. In this paper, we focus on comparing different segmentation strategies in terms of machine translation quality. Our empirical study covers both English-to-Chinese and Chinese-to-English translation for the first time. Our results show the necessity of word segmentation depends on the translation direction. After comparing two types of segmentation strategies with associated linguistic resources, we demonstrate that optimizing segmentation itself does not guarantee better MT performance, and segmentation strategy choice is not the key to improve MT. Instead, we discover that linguistical resources such as segmented corpora or the dictionaries that segmentation tools rely on actually determine how word segmentation affects machine translation. Based on these findings, we propose an empirical approach that directly optimize dictionary with respect to the MT task for word segmenter, providing a BLEU score improvement of 1.30.