Effects of data grouping on calibration measures of classifiers

  • Authors:
  • Stephan Dreiseitl;Melanie Osl

  • Affiliations:
  • Dept. of Software Engineering, Upper Austria University of Applied Sciences, Hagenberg, Austria;Division of Biomedical Informatics, University of California, San Diego, La Jolla, California

  • Venue:
  • EUROCAST'11 Proceedings of the 13th international conference on Computer Aided Systems Theory - Volume Part I
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The calibration of a probabilistic classifier refers to the extend to which its probability estimates match the true class membership probabilities. Measuring the calibration of a classifier usually relies on performing chi-squared goodness-of-fit tests between grouped probabilities and the observations in these groups. We considered alternatives to the Hosmer-Lemeshow test, the standard chi-squared test with groups based on sorted model outputs. Since this grouping does not represent "natural" groupings in data space, we investigated a chi-squared test with grouping strategies in data space. Using a series of artificial data sets for which the correct models are known, and one real-world data set, we analyzed the performance of the Pigeon-Heyse test with groupings by self-organizing maps, k-means clustering, and random assignment of points to groups. We observed that the Pigeon-Heyse test offers slightly better performance than the Hosmer-Lemeshow test while being able to locate regions of poor calibration in data space.