Natural conjugate gradient training of multilayer perceptrons

Authors:
Ana González;José R. Dorronsoro
Affiliations:
Dpto. de Ingeniería Informática and Instituto de Ingeniería del Conocimiento, Universidad Autónoma de Madrid, Madrid, Spain;Dpto. de Ingeniería Informática and Instituto de Ingeniería del Conocimiento, Universidad Autónoma de Madrid, Madrid, Spain
Venue:
ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part I
Year:
2006

Citing 6
Cited 1

Natural gradient works efficiently in learning

Neural Computation
Complexity issues in natural gradient descent method for training multilayer perceptrons

Neural Computation
Effiicient BackProp

Neural Networks: Tricks of the Trade, this book is an outgrowth of a 1996 NIPS workshop
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons

Neural Computation
On "Natural" Learning and Pruning in Multilayered Perceptrons

Neural Computation

Natural conjugate gradient training of multilayer perceptrons

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

For maximum log–likelihood estimation, the Fisher matrix defines a Riemannian metric in weight space and, as shown by Amari and his coworkers, the resulting natural gradient greatly accelerates on–line multilayer perceptron (MLP) training. While its batch gradient descent counterpart also improves on standard gradient descent (as it gives a Gauss–Newton approximation to mean square error minimization), it may no longer be competitive with more advanced gradient–based function minimization procedures. In this work we shall show how to introduce natural gradients in a conjugate gradient (CG) setting, showing numerically that when applied to batch MLP learning, they lead to faster convergence to better minima than that achieved by standard euclidean CG descent. Since a drawback of full natural gradient is its larger computational cost, we also consider some cost simplifying variants and show that one of them, diagonal natural CG, also gives better minima than standard CG, with a comparable complexity.