Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis

Authors:
Pablo A. Jaskowiak;Ricardo J. G. B. Campello;Ivan G. Costa Filho
Affiliations:
University of São Paulo, São Carlos;University of São Paulo, São Carlos;Federal University of Pernambuco, Recife and Aachen University Medical School, RWTH Aachen
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2013

Citing 26
Cited 0

Algorithms for clustering data

Algorithms for clustering data
Clustering gene expression patterns

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems

Machine Learning
Cluster Analysis for Gene Expression Data: A Survey

IEEE Transactions on Knowledge and Data Engineering
Analyzing Gene Expression Time-Courses

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Clustering short time series gene expression data

Bioinformatics
Clustering of gene expression data using a local shape-based similarity measure

Bioinformatics
The Graphical Query Language: a tool for analysis of gene expression time-courses

Bioinformatics
A knowledge-driven approach to cluster validity assessment

Bioinformatics
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Correlation between Gene Expression and GO Semantic Similarity

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Comparison of the Data-based and Gene Ontology-Based Approaches to Cluster Validation Methods for Gene Microarrays

CBMS '06 Proceedings of the 19th IEEE Symposium on Computer-Based Medical Systems
Clustering microarray gene expression data using weighted Chinese restaurant process

Bioinformatics
Evaluation and comparison of gene clustering methods in microarray analysis

Bioinformatics
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Comparative study on proximity indices for cluster analysis of gene expression time series

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology - SBRN'02
A modified correlation coefficient based similarity measure for clustering time-course gene expression data

Pattern Recognition Letters
Top 10 algorithms in data mining

Knowledge and Information Systems
Techniques for clustering gene expression data

Computers in Biology and Medicine
Clustering

Clustering
On comparing two sequences of numbers and its applications to clustering analysis

Information Sciences: an International Journal
Clustering of gene expression data based on shape similarity

EURASIP Journal on Bioinformatics and Systems Biology - Special issue on applications of signal procesing techniques to bioinformatics, genomics, and proteomics
Clustering of unevenly sampled gene expression time-series data

Fuzzy Sets and Systems
Distance functions, clustering algorithms and microarray data analysis

LION'10 Proceedings of the 4th international conference on Learning and intelligent optimization
A Comparative Study on the Use of Correlation Coefficients for Redundant Feature Elimination

SBRN '10 Proceedings of the 2010 Eleventh Brazilian Symposium on Neural Networks
The three steps of clustering in the post-genomic era: a synopsis

CIBB'10 Proceedings of the 7th international conference on Computational intelligence methods for bioinformatics and biostatistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper, we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 data sets from time course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance. Given that different measures stood out for time course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data, we preprocessed and compiled 17 data sets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.