GP classification under imbalanced data sets: active sub-sampling and AUC approximation

  • Authors:
  • John Doucette;Malcolm I. Heywood

  • Affiliations:
  • Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada;Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada

  • Venue:
  • EuroGP'08 Proceedings of the 11th European conference on Genetic programming
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The problem of evolving binary classification models under increasingly unbalanced data sets is approached by proposing a strategy consisting of two components: Sub-sampling and 'robust' fitness function design. In particular, recent work in the wider machine learning literature has recognized that maintaining the original distribution of exemplars during training is often not appropriate for designing classifiers that are robust to degenerate classifier behavior. To this end we propose a 'Simple Active Learning Heuristic' (SALH) in which a subset of exemplars is sampled with uniform probability under a class balance enforcing rule for fitness evaluation. In addition, an efficient estimator for the Area Under the Curve (AUC) performance metric is assumed in the form of a modified Wilcoxon-Mann-Whitney (WMW) statistic. Performance is evaluated in terms of six representative UCI data sets and benchmarked against: canonical GP, SALH based GP, SALH and the modified WMW statistic, and deterministic classifiers (Naive Bayes and C4.5). The resulting SALH-WMW model is demonstrated to be both efficient and effective at providing solutions maximizing performance assessed in terms of AUC.