A finite-sample simulation study of cross validation in tree-based models

Authors:
Seoung Bum Kim;Xiaoming Huo;Kwok-Leung Tsui
Affiliations:
Department of Industrial Systems and Information Engineering, Korea University, Seoul, Korea;Department of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, USA 30332;Department of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, USA 30332
Venue:
Information Technology and Management
Year:
2009

Citing 4
Cited 0

Cross-validation for binary classification by real-valued functions: theoretical analysis

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Algorithmic stability and sanity-check bounds for leave-one-out cross-validation

Neural Computation
FBP: A Frontier-Based Tree-Pruning Algorithm

INFORMS Journal on Computing
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cross validation (CV) has been widely used for choosing and evaluating statistical models. The main purpose of this study is to explore the behavior of CV in tree-based models. We achieve this goal by an experimental approach, which compares a cross-validated tree classifier with the Bayes classifier that is ideal for the underlying distribution. The main observation of this study is that the difference between the testing and training errors from a cross-validated tree classifier and the Bayes classifier empirically has a linear regression relation. The slope and the coefficient of determination of the regression model can serve as performance measure of a cross-validated tree classifier. Moreover, simulation reveals that the performance of a cross-validated tree classifier depends on the geometry, parameters of the underlying distributions, and sample sizes. Our study can explain, evaluate, and justify the use of CV in tree-based models when the sample size is relatively small.