The Impact of Gene Selection on Imbalanced Microarray Expression Data

  • Authors:
  • Abu H. Kamal;Xingquan Zhu;Abhijit S. Pandya;Sam Hsu;Muhammad Shoaib

  • Affiliations:
  • Department of Computer Science & Engineering, Florida Atlantic University, Boca Raton, USA FL 33431;Department of Computer Science & Engineering, Florida Atlantic University, Boca Raton, USA FL 33431;Department of Computer Science & Engineering, Florida Atlantic University, Boca Raton, USA FL 33431;Department of Computer Science & Engineering, Florida Atlantic University, Boca Raton, USA FL 33431;Department of Computer Science & Engineering, Florida Atlantic University, Boca Raton, USA FL 33431

  • Venue:
  • BICoB '09 Proceedings of the 1st International Conference on Bioinformatics and Computational Biology
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Microarray experiments usually output small volumes but high dimensional data. Selecting a number of genes relevant to the tasks at hand is usually one of the most important steps for the expression data analysis. While numerous researches have demonstrated the effectiveness of gene selection from different perspectives, existing endeavors, unfortunately, ignore the data imbalance reality, where one type of samples (e.g., cancer tissues) may be significantly fewer than the other (e.g., normal tissues). In this paper, we carry out a systematic study to investigate the impact of gene selection on imbalanced microarray data. Our objective is to understand that if gene selection is applied to imbalanced expression data, what kind of consequences it may bring to the final results? For this purpose, we apply five gene selection measures to eleven microarray datasets, and employ four learning methods to build classification models from the data containing selected genes only. Our study will bring important findings and draw numerous conclusions on (1) the impact of gene selection on imbalanced data, and (2) behaviors of different learning methods on the selected data.