Feature Selection for Maximizing the Area Under the ROC Curve

  • Authors:
  • Rui Wang;Ke Tang

  • Affiliations:
  • -;-

  • Venue:
  • ICDMW '09 Proceedings of the 2009 IEEE International Conference on Data Mining Workshops
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Feature selection is an important pre-processing step for solving classification problems. A good feature selection method may not only improve the performance of the final classifier, but also reduce the computational complexity of it. Traditionally, feature selection methods were developed to maximize the classification accuracy of a classifier. Recently, both theoretical and experimental studies revealed that a classifier with the highest accuracy might not be ideal in real-world problems. Instead, the Area Under the ROC Curve (AUC) has been suggested as the alternative metric, and many existing learning algorithms have been modified in order to seek the classifier with maximum AUC. However, little work was done to develop new feature selection methods to suit the requirement of AUC maximization. To fill this gap in the literature, we propose in this paper a novel algorithm, called AUC and Rank Correlation coefficient Optimization (ARCO) algorithm. ARCO adopts the general framework of a well-known method, namely minimal redundancy- maximal-relevance (mRMR) criterion, but defines the terms ”relevance” and ”redundancy” in totally different ways. Such a modification looks trivial from the perspective of algorithmic design. Nevertheless, experimental study on four gene expression data sets showed that feature subsets obtained by ARCO resulted in classifiers with significantly larger AUC than the feature subsets obtained by mRMR. Moreover, ARCO also outperformed the Feature Assessment by Sliding Thresholds algorithm, which was recently proposed for AUC maximization, and thus the efficacy of ARCO was validated.