Presenting and analyzing the results of ai experiments: data averaging and data snooping

Authors:
C. Lee Giles;Steve Lawrence
Affiliations:
NEC Research Institute, Princeton, NJ;NEC Research Institute, Princeton, NJ
Venue:
AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Year:
1997

Citing 5
Cited 3

Note on learning rate schedules for stochastic optimization

NIPS-3 Proceedings of the 1990 conference on Advances in neural information processing systems 3
Original Contribution: Training a 3-node neural network is NP-complete

Neural Networks
Empirical methods for artificial intelligence

Empirical methods for artificial intelligence
A quantitative study of experimental evaluations of neural network learning algorithms: current research practice

Neural Networks
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation

Coordinating agent activities in knowledge discovery processes

WACC '99 Proceedings of the international joint conference on Work activities coordination and collaboration
Performance comparisons of neural networks and machine learning techniques: a critical assessment of the methodology

New learning paradigms in soft computing
Lessons in neural network training: overfitting may be harder than expected

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Experimental results reported in the machine learning AI literature can be misleading. This paper investigates the common processes of data averaging (reporting results in terms of the mean and standard deviation of the results from multiple trials) and data snooping in the context of neural networks, one of the most popular AI machine learning models. Both of these processes can result in misleading results and inaccurate conclusions. We demonstrate how easily this can happen and propose techniques for avoiding these very important problems. For data averaging, common presentation assumes that the distribution of individual results is Gaussian. However, we investigate the distribution for common problems and find that it often does not approximate the Gaussian distribution, may not be symmetric, and may be multimodal. We show that assuming Gaussian distributions can significantly affect the interpretation of results, especially those of comparison studies. For a controlled task, we find that the distribution of performance is skewed towards better performance for smoother target functions and skewed towards worse performance for more complex target functions. We propose new guidelines for reporting performance which provide more information about the actual distribution (e.g. box-whiskers plots). For data snooping, we demonstrate that optimization of performance via experimentation with multiple parameters can lead to significance being assigned to results which are due to chance. We suggest that precise descriptions of experimental techniques can be very important to the evaluation of results, and that we need to be aware of potential data snooping biases when formulating these experimental techniques (e.g. selecting the test procedure). Additionally, it is important to only rely on appropriate statistical tests and to ensure that any assumptions made in the tests are valid (e.g. normality of the distribution).