Information Retrieval
BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic evaluation of summaries using N-gram co-occurrence statistics
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Minimum error rate training in statistical machine translation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
BLANC: learning evaluation metrics for MT
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Automatically evaluating answers to definition questions
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Supervised automatic evaluation for summarization with voted regression model
Information Processing and Management: an International Journal
Abstractive headline generation using WIDL-expressions
Information Processing and Management: an International Journal
An n-Gram and Initial Description Based Approach for Entity Ranking Track
Focused Access to XML Documents
Hi-index | 0.00 |
In this paper we propose a unified framework for automatic evaluation of NLP applications using N-gram co-occurrence statistics. The automatic evaluation metrics proposed to date for Machine Translation and Automatic Summarization are particular instances from the family of metrics we propose. We show that different members of the same family of metrics explain best the variations obtained with human evaluations, according to the application being evaluated (Machine Translation, Automatic Summarization, and Automatic Question Answering) and the evaluation guidelines used by humans for evaluating such applications.