Language independent word segmentation for statistical machine translation

  • Authors:
  • Michael Paul;Andrew Finch;Eiichiro Sumita

  • Affiliations:
  • National Institute of Information and Communications Technology (NICT), Kyoto, Japan;National Institute of Information and Communications Technology (NICT), Kyoto, Japan;National Institute of Information and Communications Technology (NICT), Kyoto, Japan

  • Venue:
  • Proceedings of the 3rd International Universal Communication Symposium
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous text in order to optimize the translation quality of statistical machine translation (SMT) approaches. The proposed method is language-independent and uses a parallel corpus to align source language characters to the corresponding word units separated by whitespace in the target language. Successive characters aligned to the same target words are merged to a larger source language unit and a Maximum Entropy (ME) algorithm is applied to learn the word segmentation that optimizes the translation quality of an SMT system trained on the re-segmented bitext. Experimental results translating five Asian languages into English revealed that the proposed method outperforms a baseline system that translates unigram segmented source language sentences.