Using collocation segmentation to augment the phrase table

  • Authors:
  • A. Carlos Henríquez Q.;R. Marta Costa-jussà;Vidas Daudaravicius;E. Rafael Banchs;B. José Mariño

  • Affiliations:
  • Universitat Politècnica de Catalunya, Barcelona, Spain;Barcelona Media Innovation Center, Barcelona, Spain;Vytautas Magnus University, Kaunas, Lithuania;Barcelona Media Innovation Center, Barcelona, Spain;Universitat Politècnica de Catalunya, Barcelona, Spain

  • Venue:
  • WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes the 2010 phrase-based statistical machine translation system developed at the TALP Research Center of the UPC in cooperation with BMIC and VMU. In phrase-based SMT, the phrase table is the main tool in translation. It is created extracting phrases from an aligned parallel corpus and then computing translation model scores with them. Performing a collocation segmentation over the source and target corpus before the alignment causes that different and larger phrases are extracted from the same original documents. We performed this segmentation and used the union of this phrase set with the phrase set extracted from the non-segmented corpus to compute the phrase table. We present the configurations considered and also report results obtained with internal and official test sets.