Finding translation pairs from English-Japanese untokenized aligned corpora

Authors:
Genichiro Kikui;Hirofumi Yamamoto
Affiliations:
ATR Spoken Language Translation Research Laboratories, Kyoto, Japan;ATR Spoken Language Translation Research Laboratories, Kyoto, Japan
Venue:
S2S '02 Proceedings of the ACL-02 workshop on Speech-to-speech translation: algorithms and systems - Volume 7
Year:
2002

Citing 5
Cited 2

A statistical approach to machine translation

Computational Linguistics
Fundamentals of speech recognition

Fundamentals of speech recognition
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Unsupervised tokenization for machine translation

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Integration of multiple bilingually-learned segmentation schemes into statistical machine translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a new algorithm for finding translation pairs in an English-Japanese parallel aligned corpus. Unlike previous methods, our algorithm does not presuppose a separate tokenizer for Japanese, but finds translation pairs as "side-effects" of unsupervised tokenization of Japanese sentences by using information from the English sentences. The algorithm is based on the observation that two Japanese sentences tend to have a common word when their English mates (i.e., aligned sentences) contain the same word. We implemented this idea as an unsupervised tokenization of Japanese with extended Hidden-Markov-Models (HMMs), where hidden n-gram probabilities (i.e., state transition probabilities) are affected by co-occurring words in the English part. Our experiment on finding noun-noun translation pairs achieved 76.3% accuracy, which was 0.4 points lower than the result using supervised tokenization.