Learning sequence-to-sequence correspondences from parallel corpora via sequential pattern mining

Authors:
Kaoru Yamamoto;Taku Kudo;Yuta Tsuboi;Yuji Matsumoto
Affiliations:
Genomic Sciences Center, Tsurumi-ku, Yokohama, Japan;Nara Institute of Science and Technology, Ikoma, Nara, Japan;Tokyo Research Laboratory, Yamato-shi, Kanagawa-ken, Japan;Nara Institute of Science and Technology, Ikoma, Nara, Japan
Venue:
HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Year:
2003

Citing 10
Cited 1

An Efficient Digital Search Algorithm by Using a Double-Array Structure

IEEE Transactions on Software Engineering
Translating collocations for bilingual lexicons: a statistical approach

Computational Linguistics
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth

Proceedings of the 17th International Conference on Data Engineering
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
An algorithm for finding noun phrase correspondences in bilingual corpora

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Towards a simple and accurate statistical approach to learning translation relationships among words

DMMT '01 Proceedings of the workshop on Data-driven methods in machine translation - Volume 14
A comparative study on translation units for bilingual lexicon extraction

DMMT '01 Proceedings of the workshop on Data-driven methods in machine translation - Volume 14
A phrase-based, joint probability model for statistical machine translation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10

Practical translation pattern acquisition from combined language resources

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an unsupervised extraction of sequence-to-sequence correspondences from parallel corpora by sequential pattern mining. The main characteristics of our method are two-fold. First, we propose a systematic way to enumerate all possible translation pair candidates of rigid and gapped sequences without falling into combinatorial explosion. Second, our method uses an efficient data structure and algorithm for calculating frequencies in a contingency table for each translation pair candidate. Our method is empirically evaluated using English-Japanese parallel corpora of 6 million words. Results indicate that it works well for multi-word translations, giving 56--84% accuracy at 19% token coverage and 11% type coverage.