Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

Authors:
Justin Werfel;Xiaohui Xie;H. Sebastian Seung
Affiliations:
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.;Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, MA 02141, U.S.A.;Howard Hughes Medical Institute, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
Venue:
Neural Computation
Year:
2005

Citing 5
Cited 7

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Statistical mechanical analysis of the dynamics of learning in perceptrons

Statistics and Computing
Summed Weight Neuron Perturbation: An O(N) Improvement Over Weight Perturbation

Advances in Neural Information Processing Systems 5, [NIPS Conference]
A Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization

Advances in Neural Information Processing Systems 5, [NIPS Conference]
An analog VLSI recurrent neural network learning a continuous-time trajectory

IEEE Transactions on Neural Networks

Sensitivity derivatives for flexible sensorimotor learning

Neural Computation
Letters: Binaural semi-blind dereverberation of noisy convoluted speech signals

Neurocomputing
A heuristically enhanced gradient approximation (HEGA) algorithm for training neural networks

Neurocomputing
Learning spike-based population codes by reward and population feedback

Neural Computation
Node perturbation learning without noiseless baseline

Neural Networks
A computational model of use-dependent motor recovery following a stroke: Optimizing corticospinal activations via reinforcement learning can explain residual capacity and other strength recovery dynamics

Neural Networks
Adaptive optimal control without weight transport

Neural Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gradient-following learning methods can encounter problems of implementation in many applications, and stochastic variants are sometimes used to overcome these difficulties. We analyze three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. Learning speed is defined as the rate of exponential decay in the learning curves. When the scalar parameter that controls the size of weight updates is chosen to maximize learning speed, node perturbation is slower than direct gradient descent by a factor equal to the number of output units; weight perturbation is slower still by an additional factor equal to the number of input units. Parallel perturbation allows faster learning than sequential perturbation, by a factor that does not depend on network size. We also characterize how uncertainty in quantities used in the stochastic updates affects the learning curves. This study suggests that in practice, weight perturbation may be slow for large networks, and node perturbation can have performance comparable to that of direct gradient descent when there are few output units. However, these statements depend on the specifics of the learning problem, such as the input distribution and the target function, and are not universally applicable.