Learning sequence-to-sequence correspondences from parallel corpora via sequential pattern mining

  • Authors:
  • Kaoru Yamamoto;Taku Kudo;Yuta Tsuboi;Yuji Matsumoto

  • Affiliations:
  • Genomic Sciences Center, Tsurumi-ku, Yokohama, Japan;Nara Institute of Science and Technology, Ikoma, Nara, Japan;Tokyo Research Laboratory, Yamato-shi, Kanagawa-ken, Japan;Nara Institute of Science and Technology, Ikoma, Nara, Japan

  • Venue:
  • HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present an unsupervised extraction of sequence-to-sequence correspondences from parallel corpora by sequential pattern mining. The main characteristics of our method are two-fold. First, we propose a systematic way to enumerate all possible translation pair candidates of rigid and gapped sequences without falling into combinatorial explosion. Second, our method uses an efficient data structure and algorithm for calculating frequencies in a contingency table for each translation pair candidate. Our method is empirically evaluated using English-Japanese parallel corpora of 6 million words. Results indicate that it works well for multi-word translations, giving 56--84% accuracy at 19% token coverage and 11% type coverage.