Robust and Accurate Cancer Classification with Gene Expression Profiling

  • Authors:
  • Haifeng Li;Keshu Zhang;Tao Jiang

  • Affiliations:
  • University of California at Riverside;Motorola, Inc.;University of California at Riverside

  • Venue:
  • CSB '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the within-class scatter matrix S_w be nonsingular. Unfortunately, Sw is always singular in the case of cancerclassification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisherýs criterion. GLDA is mathematically well-founded and coincides with the conventional LDA when S_w is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of S_w, and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc.