The word is mightier than the count: accumulating translation resources from parsed parallel corpora

Authors:
Stephen Nightingale;Hideki Tanaka
Affiliations:
ATR, Spoken Language Translation Research Laboratory, Kyoto, Japan;ATR, Spoken Language Translation Research Laboratory, Kyoto, Japan
Venue:
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Year:
2003

Citing 5
Cited 1

Empirical methods for exploiting parallel texts

Empirical methods for exploiting parallel texts
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
K-vec: a new approach for aligning parallel texts

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Immediate-head parsing for language models

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Japanese dependency structure analysis based on support vector machines

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13

Comparing the sentence alignment yield from two news corpora using a dictionary-based alignment system

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large, high-quality, sentence aligned parallel corpora are hard to come by, and this makes the Statistical Machine Translation enterprise more difficult. Even noisy corpora can provide useful translation resources not otherwise available though. Many investigations have used statistical methods to find word correspondences. Often such methods suffer from overgeneration, so to correct this we filter relevant translation candidates using a lexical post-process. This dictionary lookup is so effective in fact that it brings into question the value of the statistical methods. Using a dictionary lookup against all combinations of phrase pairs as a baseline, we compare three statistical methods and report the results. The three methods are (1) Mutual Information; (2) Expectation Maximization over word co-occurrence frequencies; and (3) EM over word alignments in every sentence. We also apply the dictionary lookup as a postprocess, to tackle overgeneration.