The general inefficiency of batch training for gradient descent learning

Authors:
D. Randall Wilson;Tony R. Martinez
Affiliations:
Fonix Corporation, 180 West Election Road Suite 200, Draper, UT;Computer Science Department, 3361 TMCB, Brigham Young University, Provo, UT
Venue:
Neural Networks
Year:
2003

Citing 12
Cited 14

Speaker-independent isolated digit recognition: multilayered perceptrons vs. dynamic time warping

Neural Networks
Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations

Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations
Fundamentals of speech recognition

Fundamentals of speech recognition
Original Contribution: A scaled conjugate gradient algorithm for fast supervised learning

Neural Networks
Artificial neural networks and their application to sequence recognition

Artificial neural networks and their application to sequence recognition
Fundamentals of neural networks: architectures, algorithms, and applications

Fundamentals of neural networks: architectures, algorithms, and applications
Neural networks for pattern recognition

Neural networks for pattern recognition
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Fundamentals of Artificial Neural Networks

Fundamentals of Artificial Neural Networks
Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Advanced Methods in Neural Computing

Advanced Methods in Neural Computing
New results on recurrent network training: unifying the algorithms and accelerating convergence

IEEE Transactions on Neural Networks

PointMap: A Real-Time Memory-Based Learning System with On-line and Post-Training Pruning

International Journal of Hybrid Intelligent Systems
Learning AAM fitting through simulation

Pattern Recognition
Theoretical analysis of batch and on-line training for gradient descent learning in neural networks

Neurocomputing
Convergence analysis of online gradient method for BP neural networks

Neural Networks
Deterministic convergence of conjugate gradient method for feedforward neural networks

Neurocomputing
Comparisons of single- and multiple-hidden-layer neural networks

ISNN'11 Proceedings of the 8th international conference on Advances in neural networks - Volume Part I
Simulation studies of on-line identification of complex processes with neural networks

ISNN'06 Proceedings of the Third international conference on Advnaces in Neural Networks - Volume Part II
Parallel implementation of back-propagation neural network software on SMP computers

PaCT'05 Proceedings of the 8th international conference on Parallel Computing Technologies
Computational properties and convergence analysis of BPNN for cyclic and almost cyclic learning with penalty

Neural Networks
Computational properties of cyclic and almost-cyclic learning with momentum for feedforward neural networks

ISNN'12 Proceedings of the 9th international conference on Advances in Neural Networks - Volume Part I
Sparse activity and sparse connectivity in supervised learning

The Journal of Machine Learning Research
Batch gradient method with smoothing L1/ 2 regularization for training of feedforward neural networks

Neural Networks
Statistical and incremental methods for neural models selection

International Journal of Artificial Intelligence and Soft Computing
Convergence of online gradient method for feedforward neural networks with smoothing L 1/2 regularization penalty

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gradient descent training of neural networks can be done in either a batch or on-line manner. A widely held myth in the neural network community is that batch training is as fast or faster and/or more 'correct' than on-line training because it supposedly uses a better approximation of the true gradient for its weight updates. This paper explains why batch training is almost always slower than on-line training--often orders of magnitude slower--especially on large training sets. The main reason is due to the ability of on-line training to follow curves in the error surface throughout each epoch, which allows it to safely use a larger learning rate and thus converge with less iterations through the training data. Empirical results on a large (20,000-instance) speech recognition task and on 26 other learning tasks demonstrate that convergence can be reached significantly faster using on-line training than batch training, with no apparent difference in accuracy.