eCCV: a new fuzzy cluster validity measure for large relational bioinformatics datasets

  • Authors:
  • Mihail Popescu;James C. Bezdek;James M. Keller

  • Affiliations:
  • Health Management and Medical Informatics Department, U. of Missouri, Columbia, MO;Electrical and Computer Engineering Department, U. of Missouri, Columbia, MO;Electrical and Computer Engineering Department, U. of Missouri, Columbia, MO

  • Venue:
  • FUZZ-IEEE'09 Proceedings of the 18th international conference on Fuzzy Systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The existence of BLAST sequence comparison algorithm and microarray technology are among the reasons that make bioinformatics the domain with the most abundant large relational datasets. For example, by BLAST-ing the genes of the human genome (around 30,000 genes) we obtain a 30,000 by 30,000 distance matrix. This matrix can not be currently stored in the memory of a typical desktop PC. In the same time, clustering the resulting matrix using a fuzzy relational clustering algorithm such as Non-Euclidean Fuzzy C-means (NERFCM) requires prior knowledge of the number of clusters existent in the data set. The question is, how can we evaluate the number of clusters if we can't even load the matrix in the memory our PC? To address this problem, we propose to extend the correlation cluster validity (CCV) that we introduced in a previous paper, denoting the new validity measure as eCCV. eCCV consists of two steps: first sampling of the large matrix followed by the estimation of the number of cluster employing CCV of the sampled data. The sampling strategy produces also a significant processing speedup. We illustrate eCCV properties on a large synthetic dataset and on a large subset of human genes obtained from the RefSeq database.