COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
C4.5: programs for machine learning
C4.5: programs for machine learning
Machine learning, neural and statistical classification
Machine learning, neural and statistical classification
Machine Learning
A decision-theoretic generalization of on-line learning and an application to boosting
Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Predictive data mining: a practical guide
Predictive data mining: a practical guide
BOAT—optimistic decision tree construction
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A Survey of Methods for Scaling Up Inductive Algorithms
Data Mining and Knowledge Discovery
Database Mining: A Performance Perspective
IEEE Transactions on Knowledge and Data Engineering
Query Learning Strategies Using Boosting and Bagging
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Journal of Artificial Intelligence Research
Hi-index | 0.00 |
An important issue in data mining and knowledge discovery is the issue of data scalability. We propose an approach to this problem by applying active learning as a method for data selection. In particular, we propose and evaluate a selective sampling method that belongs to the general category of 'uncertainty sampling,' by adopting and extending the 'query by bagging' method, proposed earlier by the authors as a query learning method. We empirically evaluate the effectiveness of the proposed method by comparing its performance against Breiman's Ivotes, a representative sampling method for scaling up inductive algorithms. Our results show that the performance of the proposed method compares favorably against that of Ivotes, both in terms of the predictive accuracy achieved using a fixed amount of computation time, and the final accuracy achieved. This is found to be especially the case when the data size approaches a million, a typical data size encountered in real world data mining applications. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels.