Obtaining low-arity discretizations from online data streams

Authors:
Tapio Elomaa;Petri Lehtinen;Matti Saarela
Affiliations:
Department of Software Systems, Tampere University of Technology, Tampere, Finland;Department of Software Systems, Tampere University of Technology, Tampere, Finland;Department of Software Systems, Tampere University of Technology, Tampere, Finland
Venue:
ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
Year:
2008

Citing 10
Cited 0

On the Handling of Continuous-Valued Attributes in Decision Tree Generation

Machine Learning
General and Efficient Multisplitting of Numerical Attributes

Machine Learning
Discretization: An Enabling Technique

Data Mining and Knowledge Discovery
Feature Selection via Discretization

IEEE Transactions on Knowledge and Data Engineering
A Modified Chi2 Algorithm for Discretization

IEEE Transactions on Knowledge and Data Engineering
On Changing Continuous Attributes into Ordered Discrete Attributes

EWSL '91 Proceedings of the European Working Session on Machine Learning
Class-Driven Statistical Discretization of Continuous Attributes (Extended Abstract)

ECML '95 Proceedings of the 8th European Conference on Machine Learning
Accurate decision trees for mining high-speed data streams

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Multisplitting Revisited: Optima-Preserving Elimination of Partition Candidates

Data Mining and Knowledge Discovery
Maintaining optimal multi-way splits for numerical attributes in data streams

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cut point analysis for discretization of numerical attributes has shown, for many commonly-used attribute evaluation functions, that adjacent value range intervals with an equal relative class distribution may be merged together without risking to find the optimal partition of the range. A natural idea is to relax this requirement and rely on a statistical test to decide whether the intervals are probably generated from the same distribution. ChiMerge is a classical algorithm for numerical interval processing operating just in this manner. ChiMerge handles the interval mergings in the order of their statistical probability. However, in online processing of the data the required n log n time is too much. In this paper we propose to do the mergings during a left-to-right scan of the intervals. Thus, we reduce the time requirement of merging down to more reasonable linear time. Such linear time operations are not necessary in connection of every example. Our empirical evaluation shows that intervals get effectively combined, their growth rate remains very moderate even when the number of examples grows excessive, and that the substantial reduction of interval numbers can even benefit prediction accuracy.