ORANGE: a method for evaluating automatic evaluation metrics for machine translation

Authors:
Chin-Yew Lin;Franz Josef Och
Affiliations:
University of Southern California, Marina del Rey, CA;University of Southern California, Marina del Rey, CA
Venue:
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Year:
2004

Citing 4
Cited 48

Introduction to algorithms

Introduction to algorithms
A new quantitative quality measure for machine translation systems

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics

Controlled Translation in an Example-based Environment: What do Automatic Evaluation Metrics Tell Us?

Machine Translation
Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
QARLA: a framework for the evaluation of text summarization systems

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Automatically evaluating answers to definition questions

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Paraphrasing for automatic evaluation

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
MT evaluation: human-like vs. human acceptable

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Task-based evaluation of text summarization using Relevance Prediction

Information Processing and Management: an International Journal
Regression for machine translation evaluation at the sentence level

Machine Translation
That's nice... what can you do with it?

Computational Linguistics
Online large-margin training of syntactic and structural translation features

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Decomposability of translation metrics for improved evaluation and efficient algorithms

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Complexity of finding the BLEU-optimal hypothesis in a confusion network

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Sentence level machine translation evaluation as a ranking problem: one step aside from BLEU

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Linguistic features for automatic evaluation of heterogenous MT systems

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
METEOR, M-BLEU and M-TER: evaluation metrics for high-correlation with human rankings of machine translation output

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
The role of pseudo references in MT evaluation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
On the robustness of syntactic and semantic features for automatic MT evaluation

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Gaming fluency: evaluating the bounds and expectations of segment-based translation memory

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
Robust machine translation evaluation with entailment features

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Measuring machine translation quality as semantic equivalence: A metric based on entailment features

Machine Translation
Edit distances with block movements and error rate confidence estimates

Machine Translation
CONANN: an online biomedical concept annotator

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Semantic Network Language Generation based on a Semantic Networks Serialization Grammar

World Wide Web
Significance tests of automatic machine translation evaluation metrics

Machine Translation
Machine translation evaluation versus quality estimation

Machine Translation
Metrics for MT evaluation: evaluating reordering

Machine Translation
Taming structured perceptrons on wild feature vectors

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Further meta-evaluation of broad-coverage surface realization

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Comparing rating scales and preference judgements in language evaluation

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Generating referring expressions in context: the GREC task evaluation challenges

Empirical methods in natural language generation
Linguistic measures for automatic machine translation evaluation

Machine Translation
Reordering metrics for MT

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
AM-FM: a semantic framework for translation quality assessment

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Syntactic discriminative language model rerankers for statistical machine translation

Machine Translation
Regression and ranking based optimisation for sentence level machine translation evaluation

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
The RWTH system combination system for WMT 2011

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Optimal search for minimum error rate training

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Corroborating text evaluation results with heterogeneous measures

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Evaluation of arabic machine translation system based on the universal networking language

NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems
Hope and fear for discriminative training of statistical translation models

The Journal of Machine Learning Research
HyTER: meaning-equivalent semantics for translation evaluation

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Structured ramp loss minimization for machine translation

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Batch tuning strategies for statistical machine translation

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Tuning as linear regression

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
A beam-search decoder for grammatical error correction

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
DFKI's SMT system for WMT 2012

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Optimization strategies for online large-margin learning in machine translation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Lattice BLEU oracles in machine translation

ACM Transactions on Speech and Language Processing (TSLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Comparisons of automatic evaluation metrics for machine translation are usually conducted on corpus level using correlation statistics such as Pearson's product moment correlation coefficient or Spearman's rank order correlation coefficient between human scores and automatic scores. However, such comparisons rely on human judgments of translation qualities such as adequacy and fluency. Unfortunately, these judgments are often inconsistent and very expensive to acquire. In this paper, we introduce a new evaluation method, Orange, for evaluating automatic machine translation evaluation metrics automatically without extra human involvement other than using a set of reference translations. We also show the results of comparing several existing automatic metrics and three new automatic metrics using Orange.