Can gaussian process regression be made robust against model mismatch?

Authors:
Peter Sollich
Affiliations:
Department of Mathematics, King's College London, London, U.K.
Venue:
Proceedings of the First international conference on Deterministic and Statistical Methods in Machine Learning
Year:
2004

Citing 6
Cited 0

Regression with input-dependent noise: a Gaussian process treatment

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Prediction with Gaussian processes: from linear regression to linear prediction and beyond

Learning in graphical models
Learning curves for Gaussian processes

Proceedings of the 1998 conference on Advances in neural information processing systems II
Upper and Lower Bounds on the Learning Curve for Gaussian Processes

Machine Learning
Learning curves for Gaussian process regression: approximations and bounds

Neural Computation
Learning Curves for Gaussian Processes Models: Fluctuations and Universality

ICANN '01 Proceedings of the International Conference on Artificial Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Learning curves for Gaussian process (GP) regression can be strongly affected by a mismatch between the ‘student' model and the ‘teacher' (true data generation process), exhibiting e.g. multiple overfitting maxima and logarithmically slow learning. I investigate whether GPs can be made robust against such effects by adapting student model hyperparameters to maximize the evidence (data likelihood). An approximation for the average evidence is derived and used to predict the optimal hyperparameter values and the resulting generalization error. For large input space dimension, where the approximation becomes exact, Bayes-optimal performance is obtained at the evidence maximum, but the actual hyperparameters (e.g. the noise level) do not necessarily reflect the properties of the teacher. Also, the theoretically achievable evidence maximum cannot always be reached with the chosen set of hyperparameters, and maximizing the evidence in such cases can actually make generalization performance worse rather than better. In lower-dimensional learning scenarios, the theory predicts—in excellent qualitative and good quantitative accord with simulations—that evidence maximization eliminates logarithmically slow learning and recovers the optimal scaling of the decrease of generalization error with training set size.