A beam-search extraction algorithm for comparable data

Authors:
Christoph Tillmann
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, N.Y.
Venue:
ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Year:
2009

Citing 8
Cited 2

The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
A DP based search using monotone alignments in statistical translation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Reliable measures for aligning Japanese-English news articles and sentences

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
A block bigram prediction model for statistical machine translation

ACM Transactions on Speech and Language Processing (TSLP)
Language and translation model adaptation using comparable corpora

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A simple sentence-level extraction algorithm for comparable data

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

Extracting parallel sentences from comparable corpora using document level alignment

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Language independent identification of parallel sentences using Wikipedia

Proceedings of the 20th international conference companion on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper extends previous work on extracting parallel sentence pairs from comparable data (Munteanu and Marcu, 2005). For a given source sentence S, a maximum entropy (ME) classifier is applied to a large set of candidate target translations. A beam-search algorithm is used to abandon target sentences as non-parallel early on during classification if they fall outside the beam. This way, our novel algorithm avoids any document-level prefiltering step. The algorithm increases the number of extracted parallel sentence pairs significantly, which leads to a BLEU improvement of about 1% on our Spanish-English data.