Theoretical analysis of function of derivative term in on-line gradient descent learning

Authors:
Kazuyuki Hara;Kentaro Katahira;Kazuo Okanoya;Masato Okada
Affiliations:
College of Industrial Technology, Nihon University, Narashino, Chiba, Japan;Center for Evolutionary Cognitive Sciences, The University of Tokyo, Meguro-ku, Tokyo, Japan,Brain Science Institute, RIKEN, Wako, Saitama, Japan;Brain Science Institute, RIKEN, Wako, Saitama, Japan,Graduate School of Frontier Science, The University of Tokyo, Kashiwa, Chiba, Japan;Graduate School of Frontier Science, The University of Tokyo, Kashiwa, Chiba, Japan,Brain Science Institute, RIKEN, Wako, Saitama, Japan,Center for Evolutionary Cognitive Sciences, The University ...
Venue:
ICANN'12 Proceedings of the 22nd international conference on Artificial Neural Networks and Machine Learning - Volume Part II
Year:
2012

Citing 5
Cited 0

Introduction to the theory of neural computation

Introduction to the theory of neural computation
A regularity condition of the information matrix of a multilayer perceptron network

Neural Networks
Natural gradient works efficiently in learning

Neural Computation
Computation with infinite neural networks

Neural Computation
Incorporating curvature information into on-line learning

On-line learning in neural networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

In on-line gradient descent learning, the local property of the derivative term of the output can slow convergence. Improving the derivative term, such as by using the natural gradient, has been proposed for speeding up the convergence. Beside this sophisticated method, "simple method" that replace the derivative term with a constant has proposed and showed that this greatly increases convergence speed. Although this phenomenon has been analyzed empirically, however, theoretical analysis is required to show its generality. In this paper, we theoretically analyze the effect of using the simple method. Our results show that, with the simple method, the generalization error decreases faster than with the true gradient descent method when the learning step is smaller than optimum value ηopt. When it is larger than ηopt, it decreases slower with the simple method, and the residual error is larger than with the true gradient descent method. Moreover, when there is output noise, ηopt is no longer optimum; thus, the simple method is not robust in noisy circumstances.