PORT: a precision-order-recall MT evaluation metric for tuning

Authors:
Boxing Chen;Roland Kuhn;Samuel Larkin
Affiliations:
National Research Council Canada, Gatineau (Québec), Canada;National Research Council Canada, Gatineau (Québec), Canada;National Research Council Canada, Gatineau (Québec), Canada
Venue:
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Year:
2012

Citing 24
Cited 2

A systematic comparison of various statistical alignment models

Computational Linguistics
HMM-based word alignment in statistical translation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Decomposability of translation metrics for improved evaluation and efficient algorithms

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Further meta-evaluation of machine translation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Fluency, adequacy, or HTER?: exploring different human judgments with a tunable MT metric

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Robust machine translation evaluation with entailment features

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
The Meteor metric for automatic evaluation of machine translation

Machine Translation
The best lexical metric for phrase-based statistical MT system optimization

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
METEOR-NEXT and the METEOR paraphrase tables: improved evaluation support for five target languages

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
The DCU dependency-based metric in WMT-MetricsMATR 2010

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
TESLA: translation evaluation of sentences with linear-programming-based analysis

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Automatic evaluation of translation quality for distant language pairs

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
MEANT: an inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility via semantic frames

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Reordering metrics for MT

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Findings of the 2011 Workshop on Statistical Machine Translation

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
AMBER: a modified BLEU, enhanced ranking metric

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
e-rating machine translation

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Better evaluation metrics lead to better machine translation

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Improving AMBER, an MT evaluation metric

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
The trouble with SMT consistency

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. In principle, tuning on these metrics should yield better systems than tuning on BLEU. However, due to issues such as speed, requirements for linguistic resources, and optimization difficulty, they have not been widely adopted for tuning. This paper presents PORT, a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. PORT does not require external resources and is quick to compute. It has a better correlation with human judgment than BLEU. We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs. PORT tuning achieves consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties).