An Evaluation of Progressive Sampling for Imbalanced Data Sets

Authors:
Willie Ng;Manoranjan Dash
Affiliations:
Nanyang Technological University, Singapore;Nanyang Technological University, Singapore
Venue:
ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Year:
2006

Citing 0
Cited 5

Hybrid sampling for imbalanced data

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
An empirical evaluation of bagging with different algorithms on imbalanced data

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
Empirical study of bagging predictors on medical data

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the emerging challenges for the data mining research community is to allow learning algorithms to mine huge databases. Sampling has often been suggested as an effective way to circumvent memory limitations as well as to improve processing speed. In this paper, we study the learning-curve sampling method, an approach for applying machine learning algorithms to massive amount of data sets. We show that a naive application of progressive sampling on data sets with highly imbalanced class distributions is often not very effective for training a learning algorithm. We then present a refinement for progressive sampling which works well in practice and is able to converge to the desired sample size very quickly and accurately. Empirical results on a number of large data sets show that our approach is able to enhance its performance.