Correcting bias in statistical tests for network classifier evaluation

Authors:
Tao Wang;Jennifer Neville;Brian Gallagher;Tina Eliassi-Rad
Affiliations:
Department of Computer Science, Purdue University, West Lafayette, IN;Department of Computer Science and Statistics, Purdue University, West Lafayette, IN;Lawrence Livermore National Laboratory, Livermore, CA;Department of Computer Science, Rutgers University, Piscataway, NJ
Venue:
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Year:
2011

Citing 5
Cited 1

Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
Inference for the Generalization Error

Machine Learning
No Unbiased Estimator of the Variance of K-Fold Cross-Validation

The Journal of Machine Learning Research
Classification in Networked Data: A Toolkit and a Univariate Case Study

The Journal of Machine Learning Research
Correcting evaluation bias of relational classifiers with network cross validation

Knowledge and Information Systems

Labels or attributes?: rethinking the neighbors for collective classification in sparsely-labeled networks

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is difficult to directly apply conventional significance tests to compare the performance of network classification models because network data instances are not independent and identically distributed. Recent work [6] has shown that paired t-tests applied to overlapping network samples will result in unacceptably high levels (e.g., up to 50%) of Type I error (i.e., the tests lead to incorrect conclusions that models are different, when they are not). Thus, we need new strategies to accurately evaluate network classifiers. In this paper, we analyze the sources of bias (e.g. dependencies among network data instances) theoretically and propose analytical corrections to standard significance tests to reduce the Type I error rate to more acceptable levels, while maintaining reasonable levels of statistical power to detect true performance differences. We validate the effectiveness of the proposed corrections empirically on both synthetic and real networks.