General and Efficient Multisplitting of Numerical Attributes
Machine Learning
Discretization: An Enabling Technique
Data Mining and Knowledge Discovery
Feature Selection via Discretization
IEEE Transactions on Knowledge and Data Engineering
A Modified Chi2 Algorithm for Discretization
IEEE Transactions on Knowledge and Data Engineering
On Changing Continuous Attributes into Ordered Discrete Attributes
EWSL '91 Proceedings of the European Working Session on Machine Learning
Class-Driven Statistical Discretization of Continuous Attributes (Extended Abstract)
ECML '95 Proceedings of the 8th European Conference on Machine Learning
Accurate decision trees for mining high-speed data streams
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Multisplitting Revisited: Optima-Preserving Elimination of Partition Candidates
Data Mining and Knowledge Discovery
Maintaining optimal multi-way splits for numerical attributes in data streams
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Hi-index | 0.00 |
Cut point analysis for discretization of numerical attributes has shown, for many commonly-used attribute evaluation functions, that adjacent value range intervals with an equal relative class distribution may be merged together without risking to find the optimal partition of the range. A natural idea is to relax this requirement and rely on a statistical test to decide whether the intervals are probably generated from the same distribution. ChiMerge is a classical algorithm for numerical interval processing operating just in this manner. ChiMerge handles the interval mergings in the order of their statistical probability. However, in online processing of the data the required n log n time is too much. In this paper we propose to do the mergings during a left-to-right scan of the intervals. Thus, we reduce the time requirement of merging down to more reasonable linear time. Such linear time operations are not necessary in connection of every example. Our empirical evaluation shows that intervals get effectively combined, their growth rate remains very moderate even when the number of examples grows excessive, and that the substantial reduction of interval numbers can even benefit prediction accuracy.