Accurate decision trees for mining high-speed data streams

  • Authors:
  • João Gama;Ricardo Rocha;Pedro Medas

  • Affiliations:
  • Univ. do Porto, R. do Campo Alegre 823, 4150 Porto, Portugal;Projecto Matemática Ensino, 3810 Aveiro, Portugal;Univ. do Porto, R. do Campo Alegre 823, 4150 Porto, Portugal

  • Venue:
  • Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.01

Visualization

Abstract

In this paper we study the problem of constructing accurate decision tree models from data streams. Data streams are incremental tasks that require incremental, online, and any-time learning algorithms. One of the most successful algorithms for mining data streams is VFDT. In this paper we extend the VFDT system in two directions: the ability to deal with continuous data and the use of more powerful classification techniques at tree leaves. The proposed system, VFDTc, can incorporate and classify new information online, with a single scan of the data, in time constant per example. The most relevant property of our system is the ability to obtain a performance similar to a standard decision tree algorithm even for medium size datasets. This is relevant due to the any-time property. We study the behaviour of VFDTc in different problems and demonstrate its utility in large and medium data sets. Under a bias-variance analysis we observe that VFDTc in comparison to C4.5 is able to reduce the variance component.