Discrete decision tree induction to avoid overfitting on categorical data

Authors:
Nittaya Kerdprasop;Kittisak Kerdprasop
Affiliations:
Data Engineering Research Unit, School of Computer Engineering, Suranaree University of Technology, Nakhon Ratchasima, Thailand;Data Engineering Research Unit, School of Computer Engineering, Suranaree University of Technology, Nakhon Ratchasima, Thailand
Venue:
MAMECTIS/NOLASC/CONTROL/WAMUS'11 Proceedings of the 13th WSEAS international conference on mathematical methods, computational techniques and intelligent systems, and 10th WSEAS international conference on non-linear analysis, non-linear systems and chaos, and 7th WSEAS international conference on dynamical systems and control, and 11th WSEAS international conference on Wavelet analysis and multirate systems: recent researches in computational techniques, non-linear systems and control
Year:
2011

Citing 9
Cited 0

Inferring decision trees using the minimum description length principle

Information and Computation
C4.5: programs for machine learning

C4.5: programs for machine learning
Overfitting Avoidance as Bias

Machine Learning
Trading Accuracy for Simplicity in Decision Trees

Machine Learning
A Comparative Analysis of Methods for Pruning Decision Trees

IEEE Transactions on Pattern Analysis and Machine Intelligence
An Empirical Comparison of Pruning Methods for Decision Tree Induction

Machine Learning
Induction of Decision Trees

Machine Learning
Learning From Noisy Examples

Machine Learning
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

A decision tree is a hierarchical structure commonly used to visualize steps in the decision making process. Decision tree induction is a data mining method to build decision tree from archival data with the intention to obtain a decision model to be used on future cases. The advantages of decision tree induction over other data mining techniques are its simple structure, ease of comprehension, and the ability to handle both numerical and categorical data. For numerical data with continuous values, the tree building algorithm simply compares the values to some constant. If the attribute has value smaller than or equal to the constant, then proceeds to the left branch; otherwise, takes the right branch. Tree branching process is much more complex on categorical data. The algorithm has to calculate the optimal branching decision based on the proportion of each individual value of categorical attribute to the target attribute. A categorical attribute with a lot of distinct values can lead to the overfitting problem. Overfitting occurs when a model is overly complex from the attempt to describe too many small samples which are the results categorical attributes with large quantities. A model that overfits the training data has poor predictive performance on unseen test data. We thus propose novel techniques based on data grouping and heuristic-based selection to deal with overfitting problem on categorical data. Our intuition is on the basis of appropriate selection of data samples to remove random error or noise before building the model. Heuristics play their role on pruning strategy during the model building phase. The implementation of our proposed method is based on the logic programming paradigm and some major functions are presented in the paper. We observe from the experimental results that our techniques work well on high dimensional categorical data in which attributes contain distinct values less than ten. For large quantities of categorical values, discretization technique is necessary.