Comparing Bayes model averaging and stacking when model approximation error cannot be ignored

Authors:
Bertrand Clarke
Affiliations:
Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z2, Canada
Venue:
The Journal of Machine Learning Research
Year:
2003

Citing 8
Cited 5

Principles and practice of information theory

Principles and practice of information theory
Elements of information theory

Elements of information theory
Original Contribution: Stacked generalization

Neural Networks
Decision theoretic generalizations of the PAC model for neural net and other learning applications

Information and Computation
An introduction to computational learning theory

An introduction to computational learning theory
Stacked regressions

Machine Learning
Bagging predictors

Machine Learning
Efficient agnostic learning of neural networks with bounded fan-in

IEEE Transactions on Information Theory - Part 2

Suboptimal behavior of Bayes and MDL in classification under misspecification

Machine Learning
Pruning extensions to stacking

Intelligent Data Analysis
Using Ensemble-Based Reasoning to Help Experts in Melanoma Diagnosis

Proceedings of the 2008 conference on Artificial Intelligence Research and Development: Proceedings of the 11th International Conference of the Catalan Association for Artificial Intelligence
Robust bayesian linear classifier ensembles

ECML'05 Proceedings of the 16th European conference on Machine Learning
Outlier ensembles: position paper

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

We compare Bayes Model Averaging, BMA, to a non-Bayes form of model averaging called stacking. In stacking, the weights are no longer posterior probabilities of models; they are obtained by a technique based on cross-validation. When the correct data generating model (DGM) is on the list of models under consideration BMA is never worse than stacking and often is demonstrably better, provided that the noise level is of order commensurate with the coefficients and explanatory variables. Here, however, we focus on the case that the correct DGM is not on the model list and may not be well approximated by the elements on the model list. We give a sequence of computed examples by choosing model lists and DGM's to contrast the risk performance of stacking and BMA. In the first examples, the model lists are chosen to reflect geometric principles that should give good performance. In these cases, stacking typically outperforms BMA, sometimes by a wide margin. In the second set of examples we examine how stacking and BMA perform when the model list includes all subsets of a set of potential predictors. When we standardize the size of terms and coefficients in this setting, we find that BMA outperforms stacking when the deviant terms in the DGM 'point' in directions accommodated by the model list but that when the deviant term points outside the model list stacking seems to do better. Overall, our results suggest the stacking has better robustness properties than BMA in the most important settings.