Text alignment in the real world: improving alignments of noisy translations using common lexical features, string matching strategies and n-gram comparisons

  • Authors:
  • Mark W. Davis;Ted E. Dunning;William C. Ogden

  • Affiliations:
  • New Mexico State University, Las Cruces, New Mexico;New Mexico State University, Las Cruces, New Mexico;New Mexico State University, Las Cruces, New Mexico

  • Venue:
  • EACL '95 Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics
  • Year:
  • 1995

Quantified Score

Hi-index 0.00

Visualization

Abstract

Alignment methods based on byte-length comparisons of alignment blocks have been remarkably successful for aligning good translations from legislative transcriptions. For noisy translations in which the parallel text of a document has significant structural differences, byte-alignment methods often do not perform well. The Pan American Health Organization (PAHO) corpus is a series of articles that were first translated by machine methods and then improved by professional translators. Many of the Spanish PAHO texts do not share formatting conventions with the corresponding English documents, refer to tables in stylistically different ways and contain extraneous information. A method based on a dynamic programming framework, but using a decision criterion derived from a combination of byte-length ratio measures, hard matching of numbers, string comparisons and n-gram co-occurrence matching substantially improves the performance of the alignment process.