A general framework for accurate and fast regression by data summarization in random decision trees

  • Authors:
  • Wei Fan;Joe McCloskey;Philip S. Yu

  • Affiliations:
  • IBM T. J. Watson Research, Hawthorne, NY;US Department of Defense, Ft. Meade, MD;IBM T. J. Watson Research, Hawthorne, NY

  • Venue:
  • Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Predicting the values of continuous variable as a function of several independent variables is one of the most important problems for data mining. A very large number of regression methods, both parametric and nonparametric, have been proposed in the past. However, since the list is quite extensive and many of these models make rather explicit, strong yet different assumptions about the type of applicable problems and involve a lot of parameters and options, choosing the appropriate regression methodology and then specifying the parameter values is a none-trivial, sometimes frustrating, task for data mining practitioners. Choosing the inappropriate methodology can have rather disappointing results. This issue is against the general utility of data mining software. For example,linear regression methods are straightforward and well-understood. However, since the linear assumption is very strong, its performance is compromised for complicated non-linear problems. Kernel-based methods perform quite well if the kernel functions are selected correctly. In this paper, we propose a straightforward approach based on summarizing the training data using an ensemble of random decisions trees. It requires very little knowledge from the user, yet is applicable to every type of regression problem that we are currently aware of. We have experimented on a wide range of problems including those that parametric methods performwell, a large selection of benchmark datasets for nonparametric regression, as well as highly non-linear stochastic problems. Our results are either significantly better than or identical to many approaches that are known to perform well on these problems.