Detecting Fractures in Classifier Performance

  • Authors:
  • David A. Cieslak;Nitesh V. Chawla

  • Affiliations:
  • -;-

  • Venue:
  • ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

A fundamental tenet assumed by many classification algorithms is the presumption that both training and testing samples are drawn from the same distribution of data this is the stationary distribution assumption. This entails that the past is strongly indicative of the future. However, in real world applications, many factors may alter the One True Model responsible for generating the data distribution both significantly and subtly. In circumstances violating the stationary distribution assumption, traditional validation schemes such as ten-folds and hold-out become poor performance predictors and classifier rankers. Thus, it becomes critical to discover the fracture points in classifier performance by discovering the divergence between populations. In this paper, we implement a comprehensive evaluation framework to identify bias, enabling selection of a "correct" classifier given the sample bias. To thoroughly evaluate the performance of classifiers within biased distributions, we consider the following three scenarios: missing completely at random (akin to stationary); missing at random; and missing not at random. The latter reflects the canonical sample selection bias problem.