A DP based search algorithm for statistical machine translation
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
QARLA: a framework for the evaluation of text summarization systems
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
ORANGE: a method for evaluating automatic evaluation metrics for machine translation
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
MT evaluation: human-like vs. human acceptable
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Evaluating machine translation with LFG dependencies
Machine Translation
(Meta-) evaluation of machine translation
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Robust machine translation evaluation with entailment features
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Better hypothesis testing for statistical machine translation: controlling for optimizer instability
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Using sense-labeled discourse connectives for statistical machine translation
EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
The surface realisation task: recent developments and future plans
INLG '12 Proceedings of the Seventh International Natural Language Generation Conference
Automatically assessing machine summary content without a gold standard
Computational Linguistics
Hi-index | 0.00 |
Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU, METEOR and the related NIST metric, are becoming increasingly important in MT research and development. This paper presents a significance test-driven comparison of n-gram-based automatic MT evaluation metrics. Statistical significance tests use bootstrapping methods to estimate the reliability of automatic machine translation evaluations. Based on this reliability estimation, we study the characteristics of different MT evaluation metrics and how to construct reliable and efficient evaluation suites.