Splitting Input for Machine Translation Using N-gram Language Model Together with Utterance Similarity*

  • Authors:
  • Takao Doi;Eiichiro Sumita

  • Affiliations:
  • The authors are with ATR Spoken Language Translation Research Laboratories, Kyoto-fu, 619--0288 Japan. E-mail: takao.doi@atr.jp;The authors are with ATR Spoken Language Translation Research Laboratories, Kyoto-fu, 619--0288 Japan. E-mail: takao.doi@atr.jp

  • Venue:
  • IEICE - Transactions on Information and Systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

In order to boost the translation quality of corpus-based MT systems for speech translation, the technique of splitting an input utterance appears promising. In previous research, many methods used word-sequence characteristics like N-gram clues among splitting positions. In this paper, to supplement splitting methods based on word-sequence characteristics, we introduce another clue using similarity based on edit-distance. In our splitting method, we generate candidates for utterance splitting based on N-grams, and select the best one by measuring the utterance similarity against a corpus. This selection is founded on the assumption that a corpus-based MT system can correctly translate an utterance that is similar to an utterance in its training corpus. We conducted experiments using three MT systems: two EBMT systems, one of which uses a phrase as a translation unit and the other of which uses an utterance, and an SMT system. The translation results under various conditions were evaluated by objective measures and a subjective measure. The experimental results demonstrate that the proposed method is valuable for the three systems. Using utterance similarity can improve the translation quality.