Fast exact multiplication by the Hessian

Authors:
Barak A. Pearlmutter
Affiliations:
-
Venue:
Neural Computation
Year:
1994

Citing 0
Cited 24

Flat minima

Neural Computation
Partial BFGS update and efficient step-length calculation for three-layer neural networks

Neural Computation
Fast curvature matrix-vector products for second-order gradient descent

Neural Computation
Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

ICANN '02 Proceedings of the International Conference on Artificial Neural Networks
Conjugate Directions for Stochastic Gradient Descent

ICANN '02 Proceedings of the International Conference on Artificial Neural Networks
Fast Curvature Matrix-Vector Products

ICANN '01 Proceedings of the International Conference on Artificial Neural Networks
Global Feedforward Neural Network Learning for Classification and Regression

EMMCVPR '01 Proceedings of the Third International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition
Dual extended Kalman filtering in recurrent neural networks

Neural Networks
On structure-exploiting trust-region regularized nonlinear least squares algorithms for neural-network learning

Neural Networks - 2003 Special issue: Advances in neural networks research — IJCNN'03
Manifold Stochastic Dynamics for Bayesian Learning

Neural Computation
Accelerated training of conditional random fields with stochastic gradient methods

ICML '06 Proceedings of the 23rd international conference on Machine learning
Fast kernel entropy estimation and optimization

Signal Processing - Special issue: Information theoretic signal processing
Fast stochastic optimization for articulated structure tracking

Image and Vision Computing
Step Size Adaptation in Reproducing Kernel Hilbert Space

The Journal of Machine Learning Research
Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator

ACM Transactions on Programming Languages and Systems (TOPLAS)
Deterministic neural classification

Neural Computation
Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

The Journal of Machine Learning Research
Efficient Weight Learning for Markov Logic Networks

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Bio-inspired and gradient-based algorithms to train MLPs: The influence of diversity

Information Sciences: an International Journal
Markov logic

Probabilistic inductive logic programming
λ-Perceptron: An adaptive classifier for data streams

Pattern Recognition
Accelerated training of maximum margin Markov models for sequence labeling: a case study of NP chunking

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Book reviews: Application of neural networks to adaptive control of nonlinear systems

Automatica (Journal of IFAC)
Training energy-based models for time-series imputation

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Just storing the Hessian H (the matrix of second derivativesδ2E/δwiδwj of the error E with respect to eachpair of weights) of a large neural network is difficult. Since acommon use of a large matrix like H is to compute its product withvarious vectors, we derive a technique that directly calculates Hv,where v is an arbitrary vector. To calculate Hv, we first define adifferential operator Rv{f(w)} =(δ/δr)f(w + rv)|r=0, note thatRv{∇w} = Hv and Rv{w} =v, and then apply Rv{·} to the equationsused to compute ∇w. The result is an exact andnumerically stable procedure for computing Hv, which takes about asmuch computation, and is about as local, as a gradient evaluation.We then apply the technique to a one pass gradient calculationalgorithm (backpropagation), a relaxation gradient calculationalgorithm (recurrent backpropagation), and two stochastic gradientcalculation algorithms (Boltzmann machines and weightperturbation). Finally, we show that this technique can be used atthe heart of many iterative techniques for computing variousproperties of H, obviating any need to calculate the fullHessian.