Substring-based machine translation

  • Authors:
  • Graham Neubig;Taro Watanabe;Shinsuke Mori;Tatsuya Kawahara

  • Affiliations:
  • Nara Institute of Science and Technology, Ikoma, Japan;National Institute of Information and Communications Technology, Kyoto, Japan;Kyoto University, Kyoto, Japan;Kyoto University, Kyoto, Japan

  • Venue:
  • Machine Translation
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al. (Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, pp. 632---641, 2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs.