Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement

Authors:
George Forman;Martin Scholz
Affiliations:
Hewlett-Packard Labs, Palo Alto, CA;Hewlett-Packard Labs, Palo Alto, CA
Venue:
ACM SIGKDD Explorations Newsletter
Year:
2010

Citing 4
Cited 8

RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
YALE: rapid prototyping for complex data mining tasks

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
BNS feature scaling: an improved representation over tf-idf for svm text classification

Proceedings of the 17th ACM conference on Information and knowledge management
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter

An experimental comparison of cross-validation techniques for estimating the area under the ROC curve

Computational Statistics & Data Analysis
On cross-validation and stacking: building seemingly predictive models on random data

ACM SIGKDD Explorations Newsletter
Finding deceptive opinion spam by any stretch of the imagination

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Can we identify manipulative behavior and the corresponding suspects on review websites using supervised learning?

NordSec'12 Proceedings of the 17th Nordic conference on Secure IT Systems
Predictive model performance: offline and online evaluations

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Combining embedded accelerometers with computer vision for recognizing food preparation activities

Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing
Automated topic naming

Empirical Software Engineering
DConfusion: a technique to allow cross study performance evaluation of fault prediction studies

Automated Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cross-validation is a mainstay for measuring performance and progress in machine learning. There are subtle differences in how exactly to compute accuracy, F-measure and Area Under the ROC Curve (AUC) in cross-validation studies. However, these details are not discussed in the literature, and incompatible methods are used by various papers and software packages. This leads to inconsistency across the research literature. Anomalies in performance calculations for particular folds and situations go undiscovered when they are buried in aggregated results over many folds and datasets, without ever a person looking at the intermediate performance measurements. This research note clarifies and illustrates the differences, and it provides guidance for how best to measure classification performance under cross-validation. In particular, there are several divergent methods used for computing F-measure, which is often recommended as a performance measure under class imbalance, e.g., for text classification domains and in one-vs.-all reductions of datasets having many classes. We show by experiment that all but one of these computation methods leads to biased measurements, especially under high class imbalance. This paper is of particular interest to those designing machine learning software libraries and researchers focused on high class imbalance.