A general framework for accurate and fast regression by data summarization in random decision trees

Authors:
Wei Fan;Joe McCloskey;Philip S. Yu
Affiliations:
IBM T. J. Watson Research, Hawthorne, NY;US Department of Defense, Ft. Meade, MD;IBM T. J. Watson Research, Hawthorne, NY
Venue:
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2006

Citing 6
Cited 4

Shape quantization and recognition with randomized trees

Neural Computation
Solving regression problems with rule-based ensemble classifiers

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Random Forests

Machine Learning
Is random model better? On its accuracy and efficiency

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Effective Estimation of Posterior Probabilities: Explaining the Accuracy of Randomized Decision Tree Approaches

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Learning through Changes: An Empirical Study of Dynamic Behaviors of Probability Estimation Trees

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining

Challenges and experience in prototyping a multi-modal stream analytic and monitoring application on System S

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Novel ensemble methods for regression via classification problems

Expert Systems with Applications: An International Journal
Semi-random model tree ensembles: an effective and scalable regression method

AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Predicting the values of continuous variable as a function of several independent variables is one of the most important problems for data mining. A very large number of regression methods, both parametric and nonparametric, have been proposed in the past. However, since the list is quite extensive and many of these models make rather explicit, strong yet different assumptions about the type of applicable problems and involve a lot of parameters and options, choosing the appropriate regression methodology and then specifying the parameter values is a none-trivial, sometimes frustrating, task for data mining practitioners. Choosing the inappropriate methodology can have rather disappointing results. This issue is against the general utility of data mining software. For example,linear regression methods are straightforward and well-understood. However, since the linear assumption is very strong, its performance is compromised for complicated non-linear problems. Kernel-based methods perform quite well if the kernel functions are selected correctly. In this paper, we propose a straightforward approach based on summarizing the training data using an ensemble of random decisions trees. It requires very little knowledge from the user, yet is applicable to every type of regression problem that we are currently aware of. We have experimented on a wide range of problems including those that parametric methods performwell, a large selection of benchmark datasets for nonparametric regression, as well as highly non-linear stochastic problems. Our results are either significantly better than or identical to many approaches that are known to perform well on these problems.