A new test system for stability measurement of marker gene selection in DNA microarray data analysis

Authors:
Fei Xiong;Heng Huang;James Ford;Fillia S. Makedon;Justin D. Pearlman
Affiliations:
Department of Computer Science, Dartmouth College, Hanover, NH;Department of Computer Science, Dartmouth College, Hanover, NH;Department of Computer Science, Dartmouth College, Hanover, NH;Department of Computer Science, Dartmouth College, Hanover, NH;Advanced Imaging Center, Dartmouth-Hitchcock Medical Center, Lebanon, NH
Venue:
PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Year:
2005

Citing 10
Cited 1

C4.5: programs for machine learning

C4.5: programs for machine learning
Estimating attributes: analysis and extensions of RELIEF

ECML-94 Proceedings of the European conference on machine learning on Machine Learning
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Machine Learning

Machine Learning
Chi2: Feature Selection and Discretization of Numeric Attributes

TAI '95 Proceedings of the Seventh International Conference on Tools with Artificial Intelligence
Bayesian mixture model based clustering of replicated microarray data

Bioinformatics
HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data

Bioinformatics
Missing value estimation for DNA microarray gene expression data: local least squares imputation

Bioinformatics
Improving reliability of gene selection from microarray functional genomics data

IEEE Transactions on Information Technology in Biomedicine

SoFoCles: Feature filtering for microarray classification based on Gene Ontology

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Microarray gene expression data contains informative features that reflect the critical processes controlling prominent biological functions. Feature selection algorithms have been used in previous biomedical research to find the “marker” genes whose expression value change corresponds to the most eminent difference between specimen classes. One problem encountered in such analysis is the imbalance between very large numbers of genes versus relatively fewer specimen samples. A common concern, therefore, is “overfitting” the data and deriving a set of marker genes with low stability over the entire set of possible specimens. To address this problem, we propose a new test environment in which synthetic data is perturbed to simulate possible variations in gene expression values. The goal is for the generated data to have appropriate properties that match natural data, and that are appropriate for use in testing the sensitivity of feature selection algorithms and validating the robustness of selected marker genes. In this paper, we evaluate a statistically-based resampling approach and a Principal Components Analysis (PCA)-based linear noise distribution approach. Our results show that both methods generate reasonable synthetic data and that the signal/noise rate (with variation weights at 5%, 10%, 20% and 30%) measurably impacts the classification accuracy and the marker genes selected. Based on these results, we identify the most appropriate marker gene selection and classification techniques for each type and level of noise we modeled.