An empirical study of learning from imbalanced data

  • Authors:
  • Xiuzhen Zhang;Yuxuan Li

  • Affiliations:
  • RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia

  • Venue:
  • ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

No consistent conclusions have been drawn from existing studies regarding the effectiveness of different approaches to learning from imbalanced data. In this paper we apply bias-variance analysis to study the utility of different strategies for imbalanced learning. We conduct experiments on 15 real-world imbalanced datasets of applying various re-sampling and induction bias adjustment strategies to the standard decision tree, naive bayes and k-nearest neighbour (k-NN) learning algorithms. Our main findings include: Imbalanced class distribution is primarily a high bias problem, which partly explains why it impedes the performance of many standard learning algorithms. Compared to the re-sampling strategies, adjusting induction bias can more significantly vary the bias and variance components of classification errors. Especially the inverse distance weighting strategy can significantly reduce the variance errors for k-NN. Based on these findings we offer practical advice on applying the re-sampling and induction bias adjustment strategies to improve imbalanced learning.