Discrete decision tree induction to avoid overfitting on categorical data

  • Authors:
  • Nittaya Kerdprasop;Kittisak Kerdprasop

  • Affiliations:
  • Data Engineering Research Unit, School of Computer Engineering, Suranaree University of Technology, Nakhon Ratchasima, Thailand;Data Engineering Research Unit, School of Computer Engineering, Suranaree University of Technology, Nakhon Ratchasima, Thailand

  • Venue:
  • MAMECTIS/NOLASC/CONTROL/WAMUS'11 Proceedings of the 13th WSEAS international conference on mathematical methods, computational techniques and intelligent systems, and 10th WSEAS international conference on non-linear analysis, non-linear systems and chaos, and 7th WSEAS international conference on dynamical systems and control, and 11th WSEAS international conference on Wavelet analysis and multirate systems: recent researches in computational techniques, non-linear systems and control
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

A decision tree is a hierarchical structure commonly used to visualize steps in the decision making process. Decision tree induction is a data mining method to build decision tree from archival data with the intention to obtain a decision model to be used on future cases. The advantages of decision tree induction over other data mining techniques are its simple structure, ease of comprehension, and the ability to handle both numerical and categorical data. For numerical data with continuous values, the tree building algorithm simply compares the values to some constant. If the attribute has value smaller than or equal to the constant, then proceeds to the left branch; otherwise, takes the right branch. Tree branching process is much more complex on categorical data. The algorithm has to calculate the optimal branching decision based on the proportion of each individual value of categorical attribute to the target attribute. A categorical attribute with a lot of distinct values can lead to the overfitting problem. Overfitting occurs when a model is overly complex from the attempt to describe too many small samples which are the results categorical attributes with large quantities. A model that overfits the training data has poor predictive performance on unseen test data. We thus propose novel techniques based on data grouping and heuristic-based selection to deal with overfitting problem on categorical data. Our intuition is on the basis of appropriate selection of data samples to remove random error or noise before building the model. Heuristics play their role on pruning strategy during the model building phase. The implementation of our proposed method is based on the logic programming paradigm and some major functions are presented in the paper. We observe from the experimental results that our techniques work well on high dimensional categorical data in which attributes contain distinct values less than ten. For large quantities of categorical values, discretization technique is necessary.