MEANT: an inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility via semantic frames

Authors:
Chi-kiu Lo;Dekai Wu
Affiliations:
Hong Kong University of Science and Technology;Hong Kong University of Science and Technology
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Year:
2011

Citing 11
Cited 12

BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Evaluating machine translation with LFG dependencies

Machine Translation
(Meta-) evaluation of machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Linguistic features for automatic evaluation of heterogenous MT systems

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Further meta-evaluation of machine translation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
A smorgasbord of features for automatic MT evaluation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Manual and automatic evaluation of machine translation between European languages

StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
The NIST 2008 Metrics for machine translation challenge--overview, methodology, metrics, and results

Machine Translation
Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
SMT versus AI redux: how semantic fames evaluate MT more accurately

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three

Structured vs. flat semantic role representations for machine translation evaluation

SSST-5 Proceedings of the Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation
SMT versus AI redux: how semantic fames evaluate MT more accurately

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
SAGAN: an approach to semantic textual similarity based on textual entailment

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
PORT: a precision-order-recall MT evaluation metric for tuning

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Probabilistic finite state machines for regression-based MT evaluation

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Towards a predicate-argument evaluation for MT

SSST-6 '12 Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation
Unsupervised vs. supervised weight estimation for semantic MT evaluation metrics

SSST-6 '12 Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation
Semantic textual similarity for MT evaluation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Fully automatic semantic MT evaluation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Fuzzy matching for N-gram-based MT evaluation

CLSW'12 Proceedings of the 13th Chinese conference on Chinese Lexical Semantics
Statistical machine translation enhancements through linguistic levels: A survey

ACM Computing Surveys (CSUR)
Multilingual joint parsing of syntactic and semantic dependencies with a latent variable model

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce a novel semi-automated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent. But more accurate, non-automatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle. We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the non-automatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacy judgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER. We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semi-automated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor cost for the evaluation procedure. The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER.