Principles and practice of information theory
Principles and practice of information theory
Elements of information theory
Elements of information theory
Original Contribution: Stacked generalization
Neural Networks
Decision theoretic generalizations of the PAC model for neural net and other learning applications
Information and Computation
An introduction to computational learning theory
An introduction to computational learning theory
Machine Learning
Machine Learning
Efficient agnostic learning of neural networks with bounded fan-in
IEEE Transactions on Information Theory - Part 2
Pruning extensions to stacking
Intelligent Data Analysis
Using Ensemble-Based Reasoning to Help Experts in Melanoma Diagnosis
Proceedings of the 2008 conference on Artificial Intelligence Research and Development: Proceedings of the 11th International Conference of the Catalan Association for Artificial Intelligence
Robust bayesian linear classifier ensembles
ECML'05 Proceedings of the 16th European conference on Machine Learning
Outlier ensembles: position paper
ACM SIGKDD Explorations Newsletter
Hi-index | 0.00 |
We compare Bayes Model Averaging, BMA, to a non-Bayes form of model averaging called stacking. In stacking, the weights are no longer posterior probabilities of models; they are obtained by a technique based on cross-validation. When the correct data generating model (DGM) is on the list of models under consideration BMA is never worse than stacking and often is demonstrably better, provided that the noise level is of order commensurate with the coefficients and explanatory variables. Here, however, we focus on the case that the correct DGM is not on the model list and may not be well approximated by the elements on the model list. We give a sequence of computed examples by choosing model lists and DGM's to contrast the risk performance of stacking and BMA. In the first examples, the model lists are chosen to reflect geometric principles that should give good performance. In these cases, stacking typically outperforms BMA, sometimes by a wide margin. In the second set of examples we examine how stacking and BMA perform when the model list includes all subsets of a set of potential predictors. When we standardize the size of terms and coefficients in this setting, we find that BMA outperforms stacking when the deviant terms in the DGM 'point' in directions accommodated by the model list but that when the deviant term points outside the model list stacking seems to do better. Overall, our results suggest the stacking has better robustness properties than BMA in the most important settings.