Evolving clusters in gene-expression data

  • Authors:
  • Eduardo R. Hruschka;Ricardo J. G. B. Campello;Leandro N. de Castro

  • Affiliations:
  • Graduate Program in Computer Science, Catholic University of Santos (UniSantos), Rua Dr. Carvalho de Mendonça 144, CEP 11.070-906 Santos, SP, Brazil;Graduate Program in Computer Science, Catholic University of Santos (UniSantos), Rua Dr. Carvalho de Mendonça 144, CEP 11.070-906 Santos, SP, Brazil and FEEC/UNICAMP, CP 6101, CEP 13.083-970 ...;Graduate Program in Computer Science, Catholic University of Santos (UniSantos), Rua Dr. Carvalho de Mendonça 144, CEP 11.070-906 Santos, SP, Brazil and FEEC/UNICAMP, CP 6101, CEP 13.083-970 ...

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2006

Quantified Score

Hi-index 0.07

Visualization

Abstract

Clustering is a useful exploratory tool for gene-expression data. Although successful applications of clustering techniques have been reported in the literature, there is no method of choice in the gene-expression analysis community. Moreover, there are only a few works that deal with the problem of automatically estimating the number of clusters in bioinformatics datasets. Most clustering methods require the number k of clusters to be either specified in advance or selected a posteriori from a set of clustering solutions over a range of k. In both cases, the user has to select the number of clusters. This paper proposes improvements to a clustering genetic algorithm that is capable of automatically discovering an optimal number of clusters and its corresponding optimal partition based upon numeric criteria. The proposed improvements are mainly designed to enhance the efficiency of the original clustering genetic algorithm, resulting in two new clustering genetic algorithms and an evolutionary algorithm for clustering (EAC). The original clustering genetic algorithm and its modified versions are evaluated in several runs using six gene-expression datasets in which the right clusters are known a priori. The results illustrate that all the proposed algorithms perform well in gene-expression data, although statistical comparisons in terms of the computational efficiency of each algorithm point out that EAC outperforms the others. Statistical evidence also shows that EAC is able to outperform a traditional method based on multiple runs of k-means over a range of k.