Efficient Data Mining by Active Learning

  • Authors:
  • Hiroshi Mamitsuka;Naoki Abe

  • Affiliations:
  • -;-

  • Venue:
  • Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

An important issue in data mining and knowledge discovery is the issue of data scalability. We propose an approach to this problem by applying active learning as a method for data selection. In particular, we propose and evaluate a selective sampling method that belongs to the general category of 'uncertainty sampling,' by adopting and extending the 'query by bagging' method, proposed earlier by the authors as a query learning method. We empirically evaluate the effectiveness of the proposed method by comparing its performance against Breiman's Ivotes, a representative sampling method for scaling up inductive algorithms. Our results show that the performance of the proposed method compares favorably against that of Ivotes, both in terms of the predictive accuracy achieved using a fixed amount of computation time, and the final accuracy achieved. This is found to be especially the case when the data size approaches a million, a typical data size encountered in real world data mining applications. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels.