Decision trees and multi-valued attributes
Machine intelligence 11
Elements of information theory
Elements of information theory
C4.5: programs for machine learning
C4.5: programs for machine learning
Efficient agnostic PAC-learning with simple hypothesis
COLT '94 Proceedings of the seventh annual conference on Computational learning theory
Noise modelling and evaluating learning from examples
Artificial Intelligence
Technical note: some properties of splitting criteria
Machine Learning
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
General and Efficient Multisplitting of Numerical Attributes
Machine Learning
Towards an effective cooperation of the user and the computer for classification
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Partitioning Nominal Attributes in Decision Trees
Data Mining and Knowledge Discovery
Feature Selection via Discretization
IEEE Transactions on Knowledge and Data Engineering
Use of Contextual Information for Feature Ranking and Discretization
IEEE Transactions on Knowledge and Data Engineering
Class-Dependent Discretization for Inductive Learning from Continuous and Mixed-Mode Data
IEEE Transactions on Pattern Analysis and Machine Intelligence
Incremental Induction of Decision Trees
Machine Learning
Machine Learning
On Changing Continuous Attributes into Ordered Discrete Attributes
EWSL '91 Proceedings of the European Working Session on Machine Learning
On Fast and Simple Algorithms for Finding Maximal Subarrays and Applications in Learning Theory
EuroCOLT '97 Proceedings of the Third European Conference on Computational Learning Theory
Necessary and Sufficient Pre-processing in Numerical Range Discretization
Knowledge and Information Systems
On the Computational Complexity of Optimal Multisplitting
Fundamenta Informaticae - Intelligent Systems
Computers and Operations Research
Improved Algorithms for Univariate Discretization of Continuous Features
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Obtaining low-arity discretizations from online data streams
ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
Maintaining optimal multi-way splits for numerical attributes in data streams
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Practical approximation of optimal multivariate discretization
ISMIS'06 Proceedings of the 16th international conference on Foundations of Intelligent Systems
Approximation algorithms for minimizing empirical error by axis-parallel hyperplanes
ECML'05 Proceedings of the 16th European conference on Machine Learning
A Theory of Evidence-based method for assessing frequent patterns
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
We consider multisplitting of numerical value ranges, a task that is encountered as a discretization step preceding induction and also embedded into learning algorithms. We are interested in finding the partition that optimizes the value of a given attribute evaluation function. For most commonly used evaluation functions this task takes quadratic time in the number of potential cut points in the numerical range. Hence, it is a potential bottleneck in data mining algorithms.We present two techniques that speed up the optimal multisplitting task. The first one aims at discarding cut point candidates in a quick linear-time preprocessing scan before embarking on the actual search. We generalize the definition of boundary points by Fayyad and Irani to allow us to merge adjacent example blocks that have the same relative class distribution. We prove for several commonly used evaluation functions that this processing removes only suboptimal cut points. Hence, the algorithm does not lose optimality.Our second technique tackles the quadratic-time dynamic programming algorithm, which is the best schema for optimizing many well-known evaluation functions. We present a technique that dynamically—i.e., during the search—prunes partitions of prefixes of the sorted data from the search space of the algorithm. The method works for all convex and cumulative evaluation functions.Together the use of these two techniques speeds up the multisplitting process considerably. Compared to the baseline dynamic programming algorithm the speed-up is around 50 percent on the average and up to 90 percent in some cases. We conclude that optimal multisplitting is fully feasible on all benchmark data sets we have encountered.