Hybrid sampling for imbalanced data
Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Discovery of frequent patterns in transactional data streams
Transactions on large-scale data- and knowledge-centered systems II
Discovery of frequent patterns in transactional data streams
Transactions on large-scale data- and knowledge-centered systems II
An empirical evaluation of bagging with different algorithms on imbalanced data
ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
Empirical study of bagging predictors on medical data
AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Hi-index | 0.00 |
One of the emerging challenges for the data mining research community is to allow learning algorithms to mine huge databases. Sampling has often been suggested as an effective way to circumvent memory limitations as well as to improve processing speed. In this paper, we study the learning-curve sampling method, an approach for applying machine learning algorithms to massive amount of data sets. We show that a naive application of progressive sampling on data sets with highly imbalanced class distributions is often not very effective for training a learning algorithm. We then present a refinement for progressive sampling which works well in practice and is able to converge to the desired sample size very quickly and accurately. Empirical results on a number of large data sets show that our approach is able to enhance its performance.