Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Precision and recall of machine translation
NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Filtering-Ranking Perceptron Learning for Partial Parsing
Machine Learning
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Coarse-to-fine n-best parsing and MaxEnt discriminative reranking
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
QARLA: a framework for the evaluation of text summarization systems
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
ORANGE: a method for evaluating automatic evaluation metrics for machine translation
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
A robust combination strategy for semantic role labeling
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
MT evaluation: human-like vs. human acceptable
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Manual and automatic evaluation of machine translation between European languages
StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
Evaluating machine translation with LFG dependencies
Machine Translation
A re-examination on features in regression based approach to automatic MT evaluation
HLT-SRWS '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Student Research Workshop
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Decomposability of translation metrics for improved evaluation and efficient algorithms
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Semantic roles for SMT: a hybrid two-pass model
NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
(Meta-) evaluation of machine translation
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
A smorgasbord of features for automatic MT evaluation
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Findings of the 2009 workshop on statistical machine translation
StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
On the robustness of syntactic and semantic features for automatic MT evaluation
StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
The contribution of linguistic features to automatic machine translation evaluation
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Cross-lingual annotation projection of semantic roles
Journal of Artificial Intelligence Research
ATEC: automatic evaluation of machine translation via word choice and word order
Machine Translation
MaxSim: performance and effects of translation fluency
Machine Translation
Metrics for MT evaluation: evaluating reordering
Machine Translation
Tackling sparse data issue in machine translation evaluation
ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Document-level automatic MT evaluation based on discourse representations
WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
All in strings: a powerful string-based automatic MT evaluation metric with multiple granularities
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Improvement of machine translation evaluation by simple linguistically motivated features
Journal of Computer Science and Technology - Special issue on natural language processing
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Structured vs. flat semantic role representations for machine translation evaluation
SSST-5 Proceedings of the Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation
Approximating a deep-syntactic metric for MT evaluation and tuning
WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
TINE: a metric to assess MT adequacy
WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Corroborating text evaluation results with heterogeneous measures
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Journal of the American Society for Information Science and Technology
Unsupervised vs. supervised weight estimation for semantic MT evaluation metrics
SSST-6 '12 Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation
Semantic textual similarity for MT evaluation
WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
DCU-symantec submission for the WMT 2012 quality estimation task
WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Match without a referee: evaluating MT adequacy without reference translations
WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Fully automatic semantic MT evaluation
WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Statistical machine translation enhancements through linguistic levels: A survey
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
Evaluation results recently reported by Callison-Burch et al. (2006) and Koehn and Monz (2006), revealed that, in certain cases, the BLEU metric may not be a reliable MT quality indicator. This happens, for instance, when the systems under evaluation are based on different paradigms, and therefore, do not share the same lexicon. The reason is that, while MT quality aspects are diverse, BLEU limits its scope to the lexical dimension. In this work, we suggest using metrics which take into account linguistic features at more abstract levels. We provide experimental results showing that metrics based on deeper linguistic information (syntactic/shallow-semantic) are able to produce more reliable system rankings than metrics based on lexical matching alone, specially when the systems under evaluation are of a different nature.