A Scalable Algorithm for Rule Post-pruning of Large Decision Trees

Authors:
Trong Dung Nguyen;Tu Bao Ho;Hiroshi Shimodaira
Affiliations:
-;-;-
Venue:
PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Year:
2001

Citing 7
Cited 1

Boolean Feature Discovery in Empirical Learning

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Pruning Algorithms for Rule Learning

Machine Learning
Machine Learning

Machine Learning
Learning Decision Lists

Machine Learning
The CN2 Induction Algorithm

Machine Learning
Generating Accurate Rule Sets Without Global Optimization

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning

Data and Knowledge Visualization in Knowledge Discovery Process

VISUAL '02 Proceedings of the 5th International Conference on Recent Advances in Visual Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Decision tree learning has become a popular and practical method in data mining because of its high predictive accuracy and ease of use. However, a set of if-then rules generated from large trees may be preferred in many cases because of at least three reasons: (i) large decision trees are difficult to understand as we may not see their hierarchical structure or get lost in navigating them, (ii) the tree structure may cause individual subconcepts to be fragmented (this is sometimes known as the "replicated subtree" problem), (iii) it is easier to combine new discovered rules with existing knowledge in a given domain. To fulfill that need, the popular decision tree learning system C4.5 applies a rule post-pruning algorithm to transform a decision tree into a rule set. However, by using a global optimization strategy, C4.5rules functions extremely slow on large datasets. On the other hand, rule post-pruning algorithms that learn a set of rules by the separate-and-conquer strategy such as CN2, IREP, or RIPPER can be scalable to large datasets, but they suffer from the crucial problem of overpruning, and do not often achieve a high accuracy as C4.5. This paper proposes a scalable algorithm for rule post-pruning of large decision trees that employs incremental pruning with improvements in order to overcome the overpruning problem. Experiments show that the new algorithm can produce rule sets that are as accurate as those generated by C4.5 and is scalable for large datasets.