Distance functions, clustering algorithms and microarray data analysis

Authors:
Raffaele Giancarlo;Giosuè Lo Bosco;Luca Pinello
Affiliations:
Dipartimento di Matematica ed Informatica, Università di Palermo, Italy;Dipartimento di Matematica ed Informatica, Università di Palermo, Italy;Dipartimento di Matematica ed Informatica, Università di Palermo, Italy
Venue:
LION'10 Proceedings of the 4th international conference on Learning and intelligent optimization
Year:
2010

Citing 4
Cited 4

Elements of information theory

Elements of information theory
Data clustering: a review

ACM Computing Surveys (CSUR)
Cluster analysis of gene expression data

Cluster analysis of gene expression data
Computational cluster validation in post-genomic data analysis

Bioinformatics

The three steps of clustering in the post-genomic era: a synopsis

CIBB'10 Proceedings of the 7th international conference on Computational intelligence methods for bioinformatics and biostatistics
Stable automatic unsupervised segmentation of retinal vessels using self-organizing maps and a modified fuzzy C-means clustering

WILF'11 Proceedings of the 9th international conference on Fuzzy logic and applications
A new dissimilarity measure for clustering seismic signals

ICIAP'11 Proceedings of the 16th international conference on Image analysis and processing - Volume Part II
Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distance functions are a fundamental ingredient of classification and clustering procedures, and this holds true also in the particular case of microarray data. In the general data mining and classification literature, functions such as Euclidean distance or Pearson correlation have gained their status of de facto standards thanks to a considerable amount of experimental validation. For microarray data, the issue of which distance function "works best" has been investigated, but no final conclusion has been reached. The aim of this paper is to shed further light on that issue. Indeed, we present an experimental study, involving several distances, assessing (a) their intrinsic separation ability and (b) their predictive power when used in conjunction with clustering algorithms. The experiments have been carried out on six benchmark microarray datasets, where the "gold solution" is known for each of them. We have used both Hierarchical and K-means clustering algorithms and external validation criteria as evaluation tools. From the methodological point of view, the main result of this study is a ranking of those measures in terms of their intrinsic and clustering abilities, highlighting also the correlations between the two. Pragmatically, based on the outcomes of the experiments, one receives the indication that Minkowski, cosine and Pearson correlation distances seems to be the best choice when dealing with microarray data analysis.