Indexes for three-class classification performance assessment: an empirical comparison

Authors:
Mehul P. Sampat;Amit C. Patel;Yuhling Wang;Shalini Gupta;Chih-Wen Kan;Alan C. Bovik;Mia K. Markey
Affiliations:
Department of Radiology, Center for Neurological Imaging, Brigham and Women’s Hospital, Boston, MA;University of Texas Southwestern, Dallas, TX;Department of Biomedical Engineering, Charlottesville,VA;Department of Electrical and Computer Engineering, University of Texas, Austin, TX;Department of Biomedical Engineering, University of Texas, Austin, TX;Department of Electrical and Computer Engineering, University of Texas, Austin, TX;Department of Biomedical Engineering, University of Texas, Austin, TX
Venue:
IEEE Transactions on Information Technology in Biomedicine
Year:
2009

Citing 4
Cited 2

Multiple-event forced-choice tasks in the theory of signal detectability

Journal of Mathematical Psychology
A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems

Machine Learning
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Self-organizing map for cluster analysis of a breast cancer database

Artificial Intelligence in Medicine

Introduction to the special section on computationalintelligence in medical systems

IEEE Transactions on Information Technology in Biomedicine - Special section on computational intelligence in medical systems
Resampling methods for quality assessment of classifier performance and optimal number of features

Signal Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Assessment of classifier performance is critical for fair comparison of methods, including considering alternative models or parameters during system design. The assessment must not only provide meaningful data on the classifier efficacy, but it must do so in a concise and clear manner. For two-class classification problems, receiver operating characteristic analysis provides a clear and concise assessment methodology for reporting performance and comparing competing systems. However, many other important biomedical questions cannot be posed as "two-class" classification tasks and more than two classes are often necessary. While severalmethods have been proposed for assessing the performance of classifiers for such multiclass problems, none has been widely accepted. The purpose of this paper is to critically review methods that have been proposed for assessing multiclass classifiers. A number of these methods provide a classifier performance index called the volume under surface (VUS). Empirical comparisons are carried out using 4 three-class case studies, in which three popular classification techniques are evaluated with these methods. Since the same classifier was assessed using multiple performance indexes, it is possible to gain insight into the relative strengths and weakness of the measures. We conclude that: 1) the method proposed by Scurfield provides the most detailed description of classifier performance and insight about the sources of error in a given classification task and 2) the methods proposed by He and Nakas also have great practical utility as they provide both the VUS and an estimate of the variance of the VUS. These estimates can be used to statistically compare two classification algorithms.