A simulation to analyze feature selection methods utilizing gene ontology for gene expression classification

Authors:
Christopher E. Gillies;Mohammad-Reza Siadat;Nilesh V. Patel;George D. Wilson
Affiliations:
-;-;-;-
Venue:
Journal of Biomedical Informatics
Year:
2013

Citing 11
Cited 0

Approximation algorithms

Approximation algorithms
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Correlation between Gene Expression and GO Semantic Similarity

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Integrating gene ontology into discriminative powers of genes for feature selection in microarray data

Proceedings of the 2007 ACM symposium on Applied computing
A Blocking Strategy to Improve Gene Selection for Classification of Gene Expression Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
A review of feature selection techniques in bioinformatics

Bioinformatics
Prediction of Cancer Class with Majority Voting Genetic Programming Classifier Using Gene Expression Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
SoFoCles: Feature filtering for microarray classification based on Gene Ontology

Journal of Biomedical Informatics
A Multiple-Filter-Multiple-Wrapper Approach to Gene Selection and Microarray Data Classification

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
ICGA-PSO-ELM Approach for Accurate Multiclass Cancer Classification Resulting in Reduced Gene Sets in Which Genes Encoding Secreted Proteins Are Highly Represented

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gene expression profile classification is a pivotal research domain assisting in the transformation from traditional to personalized medicine. A major challenge associated with gene expression data classification is the small number of samples relative to the large number of genes. To address this problem, researchers have devised various feature selection algorithms to reduce the number of genes. Recent studies have been experimenting with the use of semantic similarity between genes in Gene Ontology (GO) as a method to improve feature selection. While there are few studies that discuss how to use GO for feature selection, there is no simulation study that addresses when to use GO-based feature selection. To investigate this, we developed a novel simulation, which generates binary class datasets, where the differentially expressed genes between two classes have some underlying relationship in GO. This allows us to investigate the effects of various factors such as the relative connectedness of the underlying genes in GO, the mean magnitude of separation between differentially expressed genes denoted by @d, and the number of training samples. Our simulation results suggest that the connectedness in GO of the differentially expressed genes for a biological condition is the primary factor for determining the efficacy of GO-based feature selection. In particular, as the connectedness of differentially expressed genes increases, the classification accuracy improvement increases. To quantify this notion of connectedness, we defined a measure called Biological Condition Annotation Level BCAL(G), where G is a graph of differentially expressed genes. Our main conclusions with respect to GO-based feature selection are the following: (1) it increases classification accuracy when BCAL(G)=0.696; (2) it decreases classification accuracy when BCAL(G)==0.7, the improvement from GO-based feature selection decreases; and (5) we recommend not using GO-based feature selection when a biological condition has less than ten genes. Our results are derived from datasets preprocessed using RMA (Robust Multi-array Average), cases where @d is between 0.3 and 2.5, and training sample sizes between 20 and 200, therefore our conclusions are limited to these specifications. Overall, this simulation is innovative and addresses the question of when SoFoCles-style feature selection should be used for classification instead of statistical-based ranking measures.