Extending the BLEU MT evaluation method with frequency weightings

Authors:
Bogdan Babych;Anthony Hartley
Affiliations:
University of Leeds, Leeds, UK;University of Leeds, Leeds, UK
Venue:
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Year:
2004

Citing 2
Cited 8

Computer Evaluation of Indexing and Text Processing

Journal of the ACM (JACM)
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics

Automatically evaluating answers to definition questions

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Paraphrasing for automatic evaluation

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Using comparable corpora to solve problems difficult for human translators

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Word error rates: decomposition over Pos classes and applications for error analysis

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Morpho-syntactic information for automatic error analysis of statistical machine translation output

StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
Fast, cheap, and creative: evaluating translation quality using Amazon's Mechanical Turk

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
ATEC: automatic evaluation of machine translation via word choice and word order

Machine Translation
The NIST 2008 Metrics for machine translation challenge--overview, methodology, metrics, and results

Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the results of an experiment on extending the automatic method of Machine Translation evaluation BLUE with statistical weights for lexical items, such as tf.idf scores. We show that this extension gives additional information about evaluated texts; in particular it allows us to measure translation Adequacy, which, for statistical MT systems, is often overestimated by the baseline BLEU method. The proposed model uses a single human reference translation, which increases the usability of the proposed method for practical purposes. The model suggests a linguistic interpretation which relates frequency weights and human intuition about translation Adequacy and Fluency.