Principles of data mining
Transforming classifier scores into accurate multiclass probability estimates
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
The use of receiver operating characteristic curves in biomedical informatics
Journal of Biomedical Informatics - Special issue: Clinical machine learning
An introduction to ROC analysis
Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
Computational Statistics & Data Analysis
Hi-index | 0.00 |
The calibration of a probabilistic classifier refers to the extend to which its probability estimates match the true class membership probabilities. Measuring the calibration of a classifier usually relies on performing chi-squared goodness-of-fit tests between grouped probabilities and the observations in these groups. We considered alternatives to the Hosmer-Lemeshow test, the standard chi-squared test with groups based on sorted model outputs. Since this grouping does not represent "natural" groupings in data space, we investigated a chi-squared test with grouping strategies in data space. Using a series of artificial data sets for which the correct models are known, and one real-world data set, we analyzed the performance of the Pigeon-Heyse test with groupings by self-organizing maps, k-means clustering, and random assignment of points to groups. We observed that the Pigeon-Heyse test offers slightly better performance than the Hosmer-Lemeshow test while being able to locate regions of poor calibration in data space.