Evaluation Evaluation

Authors:
David M. W. Powers
Affiliations:
AILab, CSEM, Flinders University of South Australia, email: David.Powers@flinders.edu.au
Venue:
Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
Year:
2008

Citing 1
Cited 4

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning

PETA: a pedagogical embodied teaching agent

Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments
Language teaching in a mixed reality games environment

Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments
Rough Diamonds in Natural Language Learning

RSKT '09 Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology
The problem with kappa

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.02

Visualization

Abstract

Over the last decade there has been increasing concern about the biases embodied in traditional evaluation methods for Natural Language Processing/Learning, particularly methods borrowed from Information Retrieval. Without knowledge of the Bias and Prevalence of the contingency being tested, or equivalently the expectation due to chance, the simple conditional probabilities Recall, Precision and Accuracy are not meaningful as evaluation measures, either individually or in combinations such as F-factor. The existence of bias in NLP measures leads to the 'improvement' of systems by increasing their bias, such as the practice of improving tagging and parsing scores by using most common value (e.g. water is always a Noun) rather than the attempting to discover the correct one. In this paper, we will analyze both biased and unbiased measures theoretically, characterizing the precise relationship between all these measures.