Effects of data grouping on calibration measures of classifiers

Authors:
Stephan Dreiseitl;Melanie Osl
Affiliations:
Dept. of Software Engineering, Upper Austria University of Applied Sciences, Hagenberg, Austria;Division of Biomedical Informatics, University of California, San Diego, La Jolla, California
Venue:
EUROCAST'11 Proceedings of the 13th international conference on Computer Aided Systems Theory - Volume Part I
Year:
2011

Citing 5
Cited 0

Principles of data mining

Principles of data mining
Transforming classifier scores into accurate multiclass probability estimates

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
The use of receiver operating characteristic curves in biomedical informatics

Journal of Biomedical Informatics - Special issue: Clinical machine learning
An introduction to ROC analysis

Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
Increasing the power: A practical approach to goodness-of-fit test for logistic regression models with continuous predictors

Computational Statistics & Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The calibration of a probabilistic classifier refers to the extend to which its probability estimates match the true class membership probabilities. Measuring the calibration of a classifier usually relies on performing chi-squared goodness-of-fit tests between grouped probabilities and the observations in these groups. We considered alternatives to the Hosmer-Lemeshow test, the standard chi-squared test with groups based on sorted model outputs. Since this grouping does not represent "natural" groupings in data space, we investigated a chi-squared test with grouping strategies in data space. Using a series of artificial data sets for which the correct models are known, and one real-world data set, we analyzed the performance of the Pigeon-Heyse test with groupings by self-organizing maps, k-means clustering, and random assignment of points to groups. We observed that the Pigeon-Heyse test offers slightly better performance than the Hosmer-Lemeshow test while being able to locate regions of poor calibration in data space.