Parallel corpora segmentation using anchor words

  • Authors:
  • Francisco Nevado;Francisco Casacuberta;Enrique Vidal

  • Affiliations:
  • Universidad Politécnica de Valencia;Universidad Politécnica de Valencia;Universidad Politécnica de Valencia

  • Venue:
  • EAMT '03 Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

A new technique for monotone segmentation of parallel corpora is introduced. This segmentation is based on a set of anchor words which are defined manually. The parallel segments are computed using a dynamic programming algorithm. To assess this technique, finite-state transducers are inferred from both non-segmented and segmented corpora. Experiments have been carried out with Spanish-English and Italian-English translation tasks. This technique has proven useful in improving the results with respect to those obtained with unsegmented corpora.