Lessons in neural network training: overfitting may be harder than expected

Authors:
Steve Lawrence;C. Lee Giles;Ah Chung Tsoi
Affiliations:
NEC Research, Princeton, NJ;NEC Research, Princeton, NJ;Faculty of Informatics,, University of Wollongong, Australia
Venue:
AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Year:
1997

Citing 6
Cited 16

Multilayer feedforward networks are universal approximators

Neural Networks
Learning internal representations by error propagation

Parallel distributed processing: explorations in the microstructure of cognition, vol. 1
Generalization by weight-elimination with application to forecasting

NIPS-3 Proceedings of the 1990 conference on Advances in neural information processing systems 3
The nature of statistical learning theory

The nature of statistical learning theory
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Presenting and analyzing the results of ai experiments: data averaging and data snooping

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence

The Role of Occam‘s Razor in Knowledge Discovery

Data Mining and Knowledge Discovery
Large Margin Nearest Neighbor Classifiers

IWANN '01 Proceedings of the 6th International Work-Conference on Artificial and Natural Neural Networks: Connectionist Models of Neurons, Learning Processes and Artificial Intelligence-Part I
Filtering search results using an optimal set of terms identified by an artificial neural network

Information Processing and Management: an International Journal
Evolving an artificial neural network classifier for condition monitoring of rotating mechanical systems

Applied Soft Computing
Forecasting financial condition of Chinese listed companies based on support vector machine

Expert Systems with Applications: An International Journal
A systematic comparison of flat and standard cascade-correlation using a student-teacher network approximation task

Connection Science
Prediction-based portfolio optimization model using neural networks

Neurocomputing
Using support vector machine with a hybrid feature selection method to the stock trend prediction

Expert Systems with Applications: An International Journal
Monitoring MLP's free parameters for generalization

AIKED'09 Proceedings of the 8th WSEAS international conference on Artificial intelligence, knowledge engineering and data bases
Fast training MLP networks with Lo-Shu data sampling

AIKED'09 Proceedings of the 8th WSEAS international conference on Artificial intelligence, knowledge engineering and data bases
Prediction of student's mood during an online test using formula-based and neural network-based method

Computers & Education
Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters

Expert Systems with Applications: An International Journal
Filtering search results using an optimal set of terms identified by an artificial neural network

Information Processing and Management: an International Journal
Effective recognition of control chart patterns in autocorrelated data using a support vector machine based approach

Computers and Industrial Engineering
On line detection of mean and variance shift using neural networks and support vector machine in multivariate processes

Applied Soft Computing
Calibration of microsimulation traffic model using neural network approach

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

For many reasons, neural networks have become very popular AI machine learning models. Two of the most important aspects of machine learning models are how well the model generalizes to unseen data, and how well the model scales with problem complexity. Using a controlled task with known optimal training error, we investigate the convergence of the backpropagation (BP) algorithm. We find that the optimal solution is typically not found. Furthermore, we observe that networks larger than might be expected can result in lower training and generalization error. This result is supported by another real world example. We further investigate the training behavior by analyzing the weights in trained networks (excess degrees of freedom are seen to do little harm and to aid convergence), and contrasting the interpolation characteristics of multi-layer perceptron neural networks (MLPs) and polynomial models. (overfitting behavior is very different - the MLP is often biased towards smoother solutions). Finally, we analyze relevant theory outlining the reasons for significant practical differences. These results bring into question common beliefs about neural network training regarding convergence and optimal network size, suggest alternate guidelines for practical use (lower fear of excess degrees of freedom), and help to direct future work (e.g. methods for creation of more parsimonious solutions, importance of the MLP/BP bias and possibly worse performance of "improved" training algorithms).