Extraction of multi-word expressions from small parallel corpora

  • Authors:
  • Yulia Tsvetkov;Shuly Wintner

  • Affiliations:
  • Language technologies institute carnegie mellon university, pittsburgh, pa, usa e-mail: yulia.tsvetkov@gmail.com;Department of computer science university of haifa, hafia, israel e-mail: shuly@cs.haifa.ac.il

  • Venue:
  • Natural Language Engineering
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a general, novel methodology for extracting multi-word expressions (MWEs) of various types, along with their translations, from small, word-aligned parallel corpora. Unlike existing approaches, we focus on misalignments; these typically indicate expressions in the source language that are translated to the target in a non-compositional way. We introduce a simple algorithm that proposes MWE candidates based on such misalignments, relying on 1:1 alignments as anchors that delimit the search space. We use a large monolingual corpus to rank and filter these candidates. Evaluation of the quality of the extraction algorithm reveals significant improvements over naïve alignment-based methods. The extracted MWEs, with their translations, are used in the training of a statistical machine translation system, showing a small but significant improvement in its performance.