Finding translation pairs from English-Japanese untokenized aligned corpora

  • Authors:
  • Genichiro Kikui;Hirofumi Yamamoto

  • Affiliations:
  • ATR Spoken Language Translation Research Laboratories, Kyoto, Japan;ATR Spoken Language Translation Research Laboratories, Kyoto, Japan

  • Venue:
  • S2S '02 Proceedings of the ACL-02 workshop on Speech-to-speech translation: algorithms and systems - Volume 7
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a new algorithm for finding translation pairs in an English-Japanese parallel aligned corpus. Unlike previous methods, our algorithm does not presuppose a separate tokenizer for Japanese, but finds translation pairs as "side-effects" of unsupervised tokenization of Japanese sentences by using information from the English sentences. The algorithm is based on the observation that two Japanese sentences tend to have a common word when their English mates (i.e., aligned sentences) contain the same word. We implemented this idea as an unsupervised tokenization of Japanese with extended Hidden-Markov-Models (HMMs), where hidden n-gram probabilities (i.e., state transition probabilities) are affected by co-occurring words in the English part. Our experiment on finding noun-noun translation pairs achieved 76.3% accuracy, which was 0.4 points lower than the result using supervised tokenization.