Efficient Data Mining by Active Learning

Authors:
Hiroshi Mamitsuka;Naoki Abe
Affiliations:
-;-
Venue:
Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Year:
2002

Citing 13
Cited 0

Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
C4.5: programs for machine learning

C4.5: programs for machine learning
Machine learning, neural and statistical classification

Machine learning, neural and statistical classification
Bagging predictors

Machine Learning
A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Predictive data mining: a practical guide

Predictive data mining: a practical guide
BOAT—optimistic decision tree construction

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
Pasting Small Votes for Classification in Large Databases and On-Line

Machine Learning
Database Mining: A Performance Perspective

IEEE Transactions on Knowledge and Data Engineering
Query Learning Strategies Using Boosting and Bagging

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Integrative Windowing

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

An important issue in data mining and knowledge discovery is the issue of data scalability. We propose an approach to this problem by applying active learning as a method for data selection. In particular, we propose and evaluate a selective sampling method that belongs to the general category of 'uncertainty sampling,' by adopting and extending the 'query by bagging' method, proposed earlier by the authors as a query learning method. We empirically evaluate the effectiveness of the proposed method by comparing its performance against Breiman's Ivotes, a representative sampling method for scaling up inductive algorithms. Our results show that the performance of the proposed method compares favorably against that of Ivotes, both in terms of the predictive accuracy achieved using a fixed amount of computation time, and the final accuracy achieved. This is found to be especially the case when the data size approaches a million, a typical data size encountered in real world data mining applications. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels.