Comparison of the decision tree, artificial neural network, and linear regression methods based on the number and types of independent variables and sample size

  • Authors:
  • Yong Soo Kim

  • Affiliations:
  • CI Division, SK telecom, 11, Euljiro 2-ga, Jung-gu, Seoul, 100-999, Republic of Korea

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2008

Quantified Score

Hi-index 12.06

Visualization

Abstract

In this article, the performance of data mining and statistical techniques was empirically compared while varying the number of independent variables, the types of independent variables, the number of classes of the independent variables, and the sample size. Our study employed 60 simulated examples, with artificial neural networks and decision trees as the data mining techniques, and linear regression as the statistical method. In the performance study, we use the RMSE value as the metric and come up with some additional findings: (i) for continuous independent variables, a statistical technique (i.e., linear regression) was superior to data mining (i.e., decision tree and artificial neural network) regardless of the number of variables and the sample size; (ii) for continuous and categorical independent variables, linear regression was best when the number of categorical variables was one, while the artificial neural network was superior when the number of categorical variables was two or more; (iii) the artificial neural network performance improved faster than that of the other methods as the number of classes of categorical variable increased.