Partitioning parallel documents using binary segmentation

Authors:
Jia Xu;Richard Zens;Hermann Ney
Affiliations:
RWTH Aachen University, Aachen, Germany;RWTH Aachen University, Aachen, Germany;RWTH Aachen University, Aachen, Germany
Venue:
StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
Year:
2006

Citing 7
Cited 1

A systematic comparison of various statistical alignment models

Computational Linguistics
A program for aligning sentences in bilingual corpora

Computational Linguistics - Special issue on using large corpora: I
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Stochastic inversion transduction grammars and bilingual parsing of parallel corpora

Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Statistical translation alignment with compositionality constraints

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

HLT '02 Proceedings of the second international conference on Human Language Technology Research

Using normalized alignment scores to detect incorrectly aligned segments

Proceedings of the 2nd international workshop on Patent information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In statistical machine translation, large numbers of parallel sentences are required to train the model parameters. However, plenty of the bilingual language resources available on web are aligned only at the document level. To exploit this data, we have to extract the bilingual sentences from these documents. The common method is to break the documents into segments using predefined anchor words, then these segments are aligned. This approach is not error free, incorrect alignments may decrease the translation quality. We present an alternative approach to extract the parallel sentences by partitioning a bilingual document into two pairs. This process is performed recursively until all the sub-pairs are short enough. In experiments on the Chinese-English FBIS data, our method was capable of producing translation results comparable to those of a state-of-the-art sentence aligner. Using a combination of the two approaches leads to better translation performance.