BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Evaluating machine translation with LFG dependencies
Machine Translation
(Meta-) evaluation of machine translation
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Linguistic features for automatic evaluation of heterogenous MT systems
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Further meta-evaluation of machine translation
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
A smorgasbord of features for automatic MT evaluation
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Manual and automatic evaluation of machine translation between European languages
StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
SMT versus AI redux: how semantic fames evaluate MT more accurately
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Structured vs. flat semantic role representations for machine translation evaluation
SSST-5 Proceedings of the Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation
SMT versus AI redux: how semantic fames evaluate MT more accurately
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
SAGAN: an approach to semantic textual similarity based on textual entailment
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
PORT: a precision-order-recall MT evaluation metric for tuning
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Probabilistic finite state machines for regression-based MT evaluation
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Towards a predicate-argument evaluation for MT
SSST-6 '12 Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation
Unsupervised vs. supervised weight estimation for semantic MT evaluation metrics
SSST-6 '12 Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation
Semantic textual similarity for MT evaluation
WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Fully automatic semantic MT evaluation
WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Fuzzy matching for N-gram-based MT evaluation
CLSW'12 Proceedings of the 13th Chinese conference on Chinese Lexical Semantics
Statistical machine translation enhancements through linguistic levels: A survey
ACM Computing Surveys (CSUR)
Multilingual joint parsing of syntactic and semantic dependencies with a latent variable model
Computational Linguistics
Hi-index | 0.00 |
We introduce a novel semi-automated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent. But more accurate, non-automatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle. We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the non-automatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacy judgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER. We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semi-automated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor cost for the evaluation procedure. The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER.