Efficient sampling and handling of variance in tuning data mining models

Authors:
Patrick Koch;Wolfgang Konen
Affiliations:
Department of Computer Science, Cologne University of Applied Sciences, Gummersbach, Germany;Department of Computer Science, Cologne University of Applied Sciences, Gummersbach, Germany
Venue:
PPSN'12 Proceedings of the 12th international conference on Parallel Problem Solving from Nature - Volume Part I
Year:
2012

Citing 6
Cited 0

Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners

IEEE Transactions on Pattern Analysis and Machine Intelligence
Feature Selection: Evaluation, Application, and Small Sample Performance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
Training Support Vector Machines: an Application to Face Detection

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Completely Derandomized Self-Adaptation in Evolution Strategies

Evolutionary Computation
Tuned data mining: a benchmark study on different tuners

Proceedings of the 13th annual conference on Genetic and evolutionary computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computational Intelligence (CI) provides good and robust working solutions for global optimization. CI is especially suited for solving difficult tasks in parameter optimization when the fitness function is noisy. Such situations and fitness landscapes frequently arise in real-world applications like Data Mining (DM). Unfortunately, parameter tuning in DM is computationally expensive and CI-based methods often require lots of function evaluations until they finally converge in good solutions. Earlier studies have shown that surrogate models can lead to a decrease of real function evaluations. However, each function evaluation remains time-consuming. In this paper we investigate if and how the fitness landscape of the parameter space changes, when only fewer observations are used for the model trainings during tuning. A representative study on seven DM tasks shows that the results are nevertheless competitive. On all these tasks, a fraction of 10-15% of the training data is sufficient. With this the computation time can be reduced by a factor of 6-10.