Cancer class prediction: Two stage clustering approach to identify informative genes

  • Authors:
  • Mohammed Alshalalfah;Reda Alhajj

  • Affiliations:
  • Department of Computer Science, University of Calgary, Calgary, Alberta, Canada. E-mail: {msalshal,alhajj}@ucalgary.ca;Department of Computer Science, University of Calgary, Calgary, Alberta, Canada. E-mail: {msalshal,alhajj}@ucalgary.ca and Department of Computer Science, Global University, Beirut, Lebanon

  • Venue:
  • Intelligent Data Analysis
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cancer classification is an important research area that has attracted the attention of several research groups over the last decades. However, there has been no general agreed upon approach for assigning tumors to known classes (a.k.a. class prediction). One challenge in microarray analysis, especially in cancerous gene expression profiles, is to identify genes or group of genes that are highly expressed in tumor cells but not in normal cells and vice versa. All of the methods described in the literature deal with features obtained directly from the data. Further, several clustering techniques have been proposed for the analysis of genome expression data, such as k-means, Self organizing maps, etc. However, these methods do not provide information about the influence of a given gene on the overall shape of the clusters. In this paper, we try to generate informative data, which can be more powerful in the classification of genes. We identify a set of reduced features capable of distinguishing between two classes by two stage clustering of genes using fuzzy c-means. In the first stage, the proposed clustering method clusters the original data. In the second stage, it clusters genes in each of the clusters produced from the first stage. We decided on using fuzzy c-means because a fuzzy model fits better gene expression data analysis by having a gene belong to different classes with a degree of membership per class. However, fuzziness parameter m is a major problem in applying fuzzy c-means for clustering. In this approach, we try to better identify the value of the fuzziness parameter when applying fuzzy c-means for microarray data. Support vector machine combined with different kernel functions are used for classification. The results from the experiments conducted on three benchmark data sets (including one multi-class data set) demonstrate the applicability and effectiveness of the proposed approach as compared to the other approaches described in the literature.