On understanding and assessing feature selection bias

  • Authors:
  • Šarunas Raudys;Richard Baumgartner;Ray Somorjai

  • Affiliations:
  • Vilnius Gediminas Technical University, Vilnius, Lithuania;Institute for Biodiagnostics, National Research Council Canada, Winnipeg, MB, Canada;Institute for Biodiagnostics, National Research Council Canada, Winnipeg, MB, Canada

  • Venue:
  • AIME'05 Proceedings of the 10th conference on Artificial Intelligence in Medicine
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Feature selection in high-dimensional biomedical data, such as gene expression arrays or biomedical spectra constitutes and important step towards biomarker discovery. Controlling feature selection bias is considered a major issue for a realistic assessment of the feature selection process. We propose a theoretical, probabilistic framework for the analysis of selection bias. In particular, we derive the means of calculating the true selection error when the performance estimates of the feature subsets are mutually dependent and the distribution density of the true error is arbitrary. We demonstrate in an extensive series of experiments the utility of the theoretical derivations with real-world datasets. We discuss the importance of understanding feature selection bias for the small sample size (n) / high dimensionality (p) situation, typical for biomedical data (genomics, proteomics, spectroscopy).