Using Supervised Complexity Measures in the Analysis of Cancer Gene Expression Data Sets

  • Authors:
  • Ivan G. Costa;Ana C. Lorena;Liciana R. Peres;Marcilio C. Souto

  • Affiliations:
  • Center of Informatics, Federal University of Pernambuco, Recife, Brazil;Center of Mathematics, Computation and Cognition, ABC Fed. Univ., Brazil;Center of Mathematics, Computation and Cognition, ABC Fed. Univ., Brazil;Dept. of Informatics and Applied Mathematics, Fed. Univ. of Rio Grande do Norte,

  • Venue:
  • BSB '09 Proceedings of the 4th Brazilian Symposium on Bioinformatics: Advances in Bioinformatics and Computational Biology
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Supervised Machine Learning methods have been successfully applied for performing gene expression based cancer diagnosis. Characteristics intrinsic to cancer gene expression data sets, such as high dimensionality, low number of samples and presence of noise makes the classification task very difficult. Furthermore, limitations in the classifier performance may often be attributed to characteristics intrinsic to a particular data set. This paper presents an analysis of gene expression data sets for cancer diagnosis using classification complexity measures. Such measures consider data geometry, distribution and linear separability as indications of complexity of the classification task. The results obtained indicate that the cancer data sets investigated are formed by mostly linearly separable non-overlapping classes, supporting the good predictive performance of robust linear classifiers, such as SVMs, on the given data sets. Furthermore, we found two complexity indices, which were good indicators for the difficulty of gene expression based cancer diagnosis.