Efficient Multisplitting Revisited: Optima-Preserving Elimination of Partition Candidates

Authors:
Tapio Elomaa;Juho Rousu
Affiliations:
Department of Computer Science, P.O. Box 26, FIN-00014, University of Helsinki, Finland. elomaa@cs.helsinki.fi;Department of Computer Science, P.O. Box 26, FIN-00014, University of Helsinki, Finland. rousu@cs.helsinki.fi
Venue:
Data Mining and Knowledge Discovery
Year:
2004

Citing 21
Cited 7

Decision trees and multi-valued attributes

Machine intelligence 11
A Distance-Based Attribute Selection Measure for Decision Tree Induction

Machine Learning
Elements of information theory

Elements of information theory
On the Handling of Continuous-Valued Attributes in Decision Tree Generation

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Efficient agnostic PAC-learning with simple hypothesis

COLT '94 Proceedings of the seventh annual conference on Computational learning theory
Noise modelling and evaluating learning from examples

Artificial Intelligence
Technical note: some properties of splitting criteria

Machine Learning
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
General and Efficient Multisplitting of Numerical Attributes

Machine Learning
Towards an effective cooperation of the user and the computer for classification

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Partitioning Nominal Attributes in Decision Trees

Data Mining and Knowledge Discovery
Feature Selection via Discretization

IEEE Transactions on Knowledge and Data Engineering
Use of Contextual Information for Feature Ranking and Discretization

IEEE Transactions on Knowledge and Data Engineering
Class-Dependent Discretization for Inductive Learning from Continuous and Mixed-Mode Data

IEEE Transactions on Pattern Analysis and Machine Intelligence
Incremental Induction of Decision Trees

Machine Learning
Induction of Decision Trees

Machine Learning
On Changing Continuous Attributes into Ordered Discrete Attributes

EWSL '91 Proceedings of the European Working Session on Machine Learning
On Fast and Simple Algorithms for Finding Maximal Subarrays and Applications in Learning Theory

EuroCOLT '97 Proceedings of the Third European Conference on Computational Learning Theory
Necessary and Sufficient Pre-processing in Numerical Range Discretization

Knowledge and Information Systems
On the Computational Complexity of Optimal Multisplitting

Fundamenta Informaticae - Intelligent Systems

Evaluating the performance of cost-based discretization versus entropy-and error-based discretization

Computers and Operations Research
Improved Algorithms for Univariate Discretization of Continuous Features

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Obtaining low-arity discretizations from online data streams

ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
Maintaining optimal multi-way splits for numerical attributes in data streams

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Practical approximation of optimal multivariate discretization

ISMIS'06 Proceedings of the 16th international conference on Foundations of Intelligent Systems
Approximation algorithms for minimizing empirical error by axis-parallel hyperplanes

ECML'05 Proceedings of the 16th European conference on Machine Learning
A Theory of Evidence-based method for assessing frequent patterns

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider multisplitting of numerical value ranges, a task that is encountered as a discretization step preceding induction and also embedded into learning algorithms. We are interested in finding the partition that optimizes the value of a given attribute evaluation function. For most commonly used evaluation functions this task takes quadratic time in the number of potential cut points in the numerical range. Hence, it is a potential bottleneck in data mining algorithms.We present two techniques that speed up the optimal multisplitting task. The first one aims at discarding cut point candidates in a quick linear-time preprocessing scan before embarking on the actual search. We generalize the definition of boundary points by Fayyad and Irani to allow us to merge adjacent example blocks that have the same relative class distribution. We prove for several commonly used evaluation functions that this processing removes only suboptimal cut points. Hence, the algorithm does not lose optimality.Our second technique tackles the quadratic-time dynamic programming algorithm, which is the best schema for optimizing many well-known evaluation functions. We present a technique that dynamically—i.e., during the search—prunes partitions of prefixes of the sorted data from the search space of the algorithm. The method works for all convex and cumulative evaluation functions.Together the use of these two techniques speeds up the multisplitting process considerably. Compared to the baseline dynamic programming algorithm the speed-up is around 50 percent on the average and up to 90 percent in some cases. We conclude that optimal multisplitting is fully feasible on all benchmark data sets we have encountered.