Sentence level machine translation evaluation as a ranking problem: one step aside from BLEU

Authors:
Yang Ye;Ming Zhou;Chin-Yew Lin
Affiliations:
University of Michigan;Microsoft Research Asia;Microsoft Research Asia
Venue:
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Year:
2007

Citing 7
Cited 16

Making large-scale support vector machine learning practical

Advances in kernel methods
Ranking definitions with supervised learning methods

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Phrasal cohesion and statistical machine translation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
ORANGE: a method for evaluating automatic evaluation metrics for machine translation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Stochastic iterative alignment for machine translation evaluation

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Re-evaluating machine translation results with paraphrase support

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

Evaluating machine translation with LFG dependencies

Machine Translation
Regression for machine translation evaluation at the sentence level

Machine Translation
Diagnostic evaluation of machine translation systems using automatically constructed linguistic check-points

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
METEOR, M-BLEU and M-TER: evaluation metrics for high-correlation with human rankings of machine translation output

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Ranking vs. regression in machine translation evaluation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
The feature subspace method for SMT system combination

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
The Meteor metric for automatic evaluation of machine translation

Machine Translation
TrustRank: inducing trust in automatic translations via ranking

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Learning simple Wikipedia: a cogitation in ascertaining abecedarian language

CL&W '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids
Integrating N-best SMT outputs into a TM system

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Improvement of machine translation evaluation by simple linguistically motivated features

Journal of Computer Science and Technology - Special issue on natural language processing
Linguistic measures for automatic machine translation evaluation

Machine Translation
Hypothesis mixture decoding for statistical machine translation

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Combining quality prediction and system selection for improved automatic translation output

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Fusion of word and letter based metrics for automatic MT evaluation

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Sentence-level ranking with quality estimation

Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper proposes formulating MT evaluation as a ranking problem, as is often done in the practice of assessment by human. Under the ranking scenario, the study also investigates the relative utility of several features. The results show greater correlation with human assessment at the sentence level, even when using an n-gram match score as a baseline feature. The feature contributing the most to the rank order correlation between automatic ranking and human assessment was the dependency structure relation rather than BLEU score and reference language model feature.