A Blocking Strategy to Improve Gene Selection for Classification of Gene Expression Data

Authors:
Gianluca Bontempi
Affiliations:
-
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2007

Citing 5
Cited 8

The Racing Algorithm: Model Selection for Lazy Learners

Artificial Intelligence Review - Special issue on lazy learning
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
The Sample Average Approximation Method for Stochastic Discrete Optimization

SIAM Journal on Optimization
An introduction to variable and feature selection

The Journal of Machine Learning Research
Design and Analysis of Experiments

Design and Analysis of Experiments

Detecting reliable gene interactions by a hierarchy of Bayesian network classifiers

Computer Methods and Programs in Biomedicine
Gene boosting for cancer classification based on gene expression profiles

Pattern Recognition
A clustering based hybrid system for biomarker selection and sample classification of mass spectrometry data

Neurocomputing
Robust Feature Selection for Microarray Data Based on Multicriterion Fusion

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Feature selection for support vector machines with RBF kernel

Artificial Intelligence Review
A simulation to analyze feature selection methods utilizing gene ontology for gene expression classification

Journal of Biomedical Informatics
Combining multiple views: Case studies on protein and arrhythmia features

Engineering Applications of Artificial Intelligence
Review: Knowledge discovery in medicine: Current issue and future trend

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Because of high dimensionality, machine learning algorithms typically rely on feature selection techniques in order to perform effective classification in microarray gene expression data sets. However, the large number of features compared to the number of samples makes the task of feature selection computationally hard and prone to errors. This paper interprets feature selection as a task of stochastic optimization, where the goal is to select among an exponential number of alternative gene subsets the one expected to return the highest generalization in classification. Blocking is an experimental design strategy which produces similar experimental conditions to compare alternative stochastic configurations in order to be confident that observed differences in accuracy are due to actual differences rather than to fluctuations and noise effects. We propose an original blocking strategy for improving feature selection which aggregates in a paired way the validation outcomes of several learning algorithms to assess a gene subset and compare it to others. This is a novelty with respect to conventional wrappers, which commonly adopt a sole learning algorithm to evaluate the relevance of a given set of variables. The rationale of the approach is that, by increasing the amount of experimental conditions under which we validate a feature subset, we can lessen the problems related to the scarcity of samples and consequently come up with a better selection. The paper shows that the blocking strategy significantly improves the performance of a conventional forward selection for a set of 16 publicly available cancer expression data sets. The experiments involve six different classifiers and show that improvements take place independent of the classification algorithm used after the selection step. Two further validations based on available biological annotation support the claim that blocking strategies in feature selection may improve the accuracy and the quality of the solution. The first validation is based on retrieving PubMEd abstracts associated to the selected genes and matching them to regular expressions describing the biological phenomenon underlying the expression data sets. The biological validation that follows is based on the use of the Bioconductor package GoStats in order to perform Gene Ontology statistical analysis.