An Evaluation of Progressive Sampling for Imbalanced Data Sets

  • Authors:
  • Willie Ng;Manoranjan Dash

  • Affiliations:
  • Nanyang Technological University, Singapore;Nanyang Technological University, Singapore

  • Venue:
  • ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the emerging challenges for the data mining research community is to allow learning algorithms to mine huge databases. Sampling has often been suggested as an effective way to circumvent memory limitations as well as to improve processing speed. In this paper, we study the learning-curve sampling method, an approach for applying machine learning algorithms to massive amount of data sets. We show that a naive application of progressive sampling on data sets with highly imbalanced class distributions is often not very effective for training a learning algorithm. We then present a refinement for progressive sampling which works well in practice and is able to converge to the desired sample size very quickly and accurately. Empirical results on a number of large data sets show that our approach is able to enhance its performance.