Obtaining low-arity discretizations from online data streams

  • Authors:
  • Tapio Elomaa;Petri Lehtinen;Matti Saarela

  • Affiliations:
  • Department of Software Systems, Tampere University of Technology, Tampere, Finland;Department of Software Systems, Tampere University of Technology, Tampere, Finland;Department of Software Systems, Tampere University of Technology, Tampere, Finland

  • Venue:
  • ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cut point analysis for discretization of numerical attributes has shown, for many commonly-used attribute evaluation functions, that adjacent value range intervals with an equal relative class distribution may be merged together without risking to find the optimal partition of the range. A natural idea is to relax this requirement and rely on a statistical test to decide whether the intervals are probably generated from the same distribution. ChiMerge is a classical algorithm for numerical interval processing operating just in this manner. ChiMerge handles the interval mergings in the order of their statistical probability. However, in online processing of the data the required n log n time is too much. In this paper we propose to do the mergings during a left-to-right scan of the intervals. Thus, we reduce the time requirement of merging down to more reasonable linear time. Such linear time operations are not necessary in connection of every example. Our empirical evaluation shows that intervals get effectively combined, their growth rate remains very moderate even when the number of examples grows excessive, and that the substantial reduction of interval numbers can even benefit prediction accuracy.