Text alignment in the real world: improving alignments of noisy translations using common lexical features, string matching strategies and n-gram comparisons

Authors:
Mark W. Davis;Ted E. Dunning;William C. Ogden
Affiliations:
New Mexico State University, Las Cruces, New Mexico;New Mexico State University, Las Cruces, New Mexico;New Mexico State University, Las Cruces, New Mexico
Venue:
EACL '95 Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics
Year:
1995

Citing 5
Cited 6

Using cognates to align sentences in bilingual corpora

CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Char_align: a program for aligning parallel texts at the character level

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Aligning sentences in bilingual corpora using lexical information

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Aligning a parallel English-Chinese corpus statistically with lexical criteria

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics

QUILT: implementing a large-scale cross-language text retrieval system

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Glossary-Based MT Engines in a Multilingual Analyst‘s Workstation Architecture

Machine Translation
Towards Universal Text Retrieval: Tipster Text Retrieval Research at New Mexico State University

Information Retrieval
Semi-automatic acquisition of domain-specific translation lexicons

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Advances in multilingual text retrieval

TIPSTER '96 Proceedings of a workshop on held at Vienna, Virginia: May 6-8, 1996
Statistical machine translation of texts with misspelled words

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Alignment methods based on byte-length comparisons of alignment blocks have been remarkably successful for aligning good translations from legislative transcriptions. For noisy translations in which the parallel text of a document has significant structural differences, byte-alignment methods often do not perform well. The Pan American Health Organization (PAHO) corpus is a series of articles that were first translated by machine methods and then improved by professional translators. Many of the Spanish PAHO texts do not share formatting conventions with the corresponding English documents, refer to tables in stylistically different ways and contain extraneous information. A method based on a dynamic programming framework, but using a decision criterion derived from a combination of byte-length ratio measures, hard matching of numbers, string comparisons and n-gram co-occurrence matching substantially improves the performance of the alignment process.