Significant improvements to the Ford-Johnson algorithm for sorting
BIT - Ellis Horwood series in artificial intelligence
The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
Sorting 13 Elements Requires 34 Comparisons
ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Minimum error rate training in statistical machine translation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Manual and automatic evaluation of machine translation between European languages
StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
Ranking vs. regression in machine translation evaluation
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Fluency constraints for minimum Bayes-risk decoding of statistical machine translation lattices
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Hi-index | 0.00 |
We introduce a novel evaluation scheme for the human evaluation of different machine translation systems. Our method is based on direct comparison of two sentences at a time by human judges. These binary judgments are then used to decide between all possible rankings of the systems. The advantages of this new method are the lower dependency on extensive evaluation guidelines, and a tighter focus on a typical evaluation task, namely the ranking of systems. Furthermore we argue that machine translation evaluations should be regarded as statistical processes, both for human and automatic evaluation. We show how confidence ranges for state-of-the-art evaluation measures such as WER and TER can be computed accurately and efficiently without having to resort to Monte Carlo estimates. We give an example of our new evaluation scheme, as well as a comparison with classical automatic and human evaluation on data from a recent international evaluation campaign.