Multivariate statistical tests for comparing classification algorithms

Authors:
Olcay Taner Yıldız;Özlem Aslan;Ethem Alpaydın
Affiliations:
Dept. of Computer Engineering, Işık University, Istanbul, Turkey;Dept. of Computer Engineering, Boğaziçi University, Istanbul, Turkey;Dept. of Computer Engineering, Boğaziçi University, Istanbul, Turkey
Venue:
LION'05 Proceedings of the 5th international conference on Learning and Intelligent Optimization
Year:
2011

Citing 7
Cited 0

Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
Inference for the Generalization Error

Machine Learning
Ensemble selection from libraries of models

ICML '04 Proceedings of the twenty-first international conference on Machine learning
A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis

Bioinformatics
The relationship between Precision-Recall and ROC curves

ICML '06 Proceedings of the 23rd international conference on Machine learning
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Aggregating performance metrics for classifier evaluation

IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration

Quantified Score

Hi-index	0.00

Visualization

Abstract

The misclassification error which is usually used in tests to compare classification algorithms, does not make a distinction between the sources of error, namely, false positives and false negatives. Instead of summing these in a single number, we propose to collect multivariate statistics and use multivariate tests on them. Information retrieval uses the measures of precision and recall, and signal detection uses true positive rate (tpr) and false positive rate (fpr) and a multivariate test can also use such two values instead of combining them in a single value, such as error or average precision. For example, we can have bivariate tests for (precision, recall) or (tpr, fpr). We propose to use the pairwise test based on Hotelling's multivariate T2 test to compare two algorithms or multivariate analysis of variance (MANOVA) to compare L2 algorithms. In our experiments, we show that the multivariate tests have higher power than the univariate error test, that is, they can detect differences that the error test cannot, and we also discuss how the decisions made by different multivariate tests differ, to be able to point out where to use which. We also show how multivariate or univariate pairwise tests can be used as post-hoc tests after MANOVA to find cliques of algorithms, or order them along separate dimensions.