Parallel Selection of Informative Genes for Classification

Authors:
Michael Slavik;Xingquan Zhu;Imad Mahgoub;Muhammad Shoaib
Affiliations:
Department of Computer Science & Engineering, Florida Atlantic University, Boca Raton, USA FL 33431;Department of Computer Science & Engineering, Florida Atlantic University, Boca Raton, USA FL 33431;Department of Computer Science & Engineering, Florida Atlantic University, Boca Raton, USA FL 33431;Department of Computer Science & Engineering, Florida Atlantic University, Boca Raton, USA FL 33431
Venue:
BICoB '09 Proceedings of the 1st International Conference on Bioinformatics and Computational Biology
Year:
2009

Citing 9
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Theoretical and Empirical Analysis of ReliefF and RReliefF

Machine Learning
A Theoretical Analysis of Gene Selection

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression

Bioinformatics
Optimal number of features as a function of sample size for various classification rules

Bioinformatics
A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis

Bioinformatics
Selecting features in microarray classification using ROC curves

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we argue that existing gene selection methods are not effective for selecting important genes when the number of samples and the data dimensions grow sufficiently large. As a solution, we propose two approaches for parallel gene selections, both are based on the well known ReliefF feature selection method. In the first design, denoted by PReliefF p , the input data are split into non-overlapping subsets assigned to cluster nodes. Each node carries out gene selection by using the ReliefF method on its own subset, without interaction with other clusters. The final ranking of the genes is generated by gathering the weight vectors from all nodes. In the second design, namely PReliefF g , each node dynamically updates the global weight vectors so the gene selection results in one node can be used to boost the selection of the other nodes. Experimental results from real-world microarray expression data show that PReliefF p and PReliefF g achieve a speedup factor nearly equal to the number of nodes. When combined with several popular classification methods, the classifiers built from the genes selected from both methods have the same or even better accuracy than the genes selected from the original ReliefF method.