Performance of feature-selection methods in the classification of high-dimension data

Authors:
Jianping Hua;Waibhav D. Tembe;Edward R. Dougherty
Affiliations:
Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004, USA;High Performance Bio-Computing Division, Translational Genomics Research Institute, Phoenix, AZ 85004, USA;Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004, USA and Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 7 ...
Venue:
Pattern Recognition
Year:
2009

Citing 17
Cited 26

Estimating attributes: analysis and extensions of RELIEF

ECML-94 Proceedings of the European conference on machine learning on Machine Learning
Floating search methods in feature selection

Pattern Recognition Letters
Feature Selection: Evaluation, Application, and Small Sample Performance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Classifier-Independent Feature Selection For Two-Stage Feature Selection

SSPR '98/SPR '98 Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition
A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression

Bioinformatics
Optimal number of features as a function of sample size for various classification rules

Bioinformatics
Analysis of recursive gene selection approaches from microarray data

Bioinformatics
What should be expected from feature selection in small-sample settings

Bioinformatics
A Branch and Bound Algorithm for Computing k-Nearest Neighbors

IEEE Transactions on Computers
Decorrelation of the true and estimated classifier errors in high-dimensional settings

EURASIP Journal on Bioinformatics and Systems Biology
A review of feature selection techniques in bioinformatics

Bioinformatics
Determination of the optimal number of features for quadratic discriminant analysis via the normal approximation to the discriminant distribution

Pattern Recognition
Impact of error estimation on feature selection

Pattern Recognition
Feature selection algorithms to find strong genes

Pattern Recognition Letters
The feature selection problem: traditional methods and a new algorithm

AAAI'92 Proceedings of the tenth national conference on Artificial intelligence
A Problem of Dimensionality: A Simple Example

IEEE Transactions on Pattern Analysis and Machine Intelligence
On the mean accuracy of statistical pattern recognizers

IEEE Transactions on Information Theory

Is bagging effective in the classification of small-sample genomic and proteomic data?

EURASIP Journal on Bioinformatics and Systems Biology - Special issue on applications of signal procesing techniques to bioinformatics, genomics, and proteomics
Incremental Bayesian Network Learning for Scalable Feature Selection

IDA '09 Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII
Ensemble gene selection by grouping for microarray data classification

Journal of Biomedical Informatics
Bagging Constraint Score for feature selection with pairwise constraints

Pattern Recognition
Gene and sample selection for cancer classification with support vectors based t-statistic

Neurocomputing
Impact of missing value imputation on classification for DNA microarray gene expression data: a model-based study

EURASIP Journal on Bioinformatics and Systems Biology
Quadratic Programming Feature Selection

The Journal of Machine Learning Research
Using support vector machines with a novel hybrid feature selection method for diagnosis of erythemato-squamous diseases

Expert Systems with Applications: An International Journal
Fuzzy complex numbers and their application for classifiers performance evaluation

Pattern Recognition
Detection of phenotypes in microarray data using force-directed placement transforms

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Integrating protein family sequence similarities with gene expression to find signature gene networks in breast cancer metastasis

PRIB'11 Proceedings of the 6th IAPR international conference on Pattern recognition in bioinformatics
Win percentage: a novel measure for assessing the suitability of machine classifiers for biological problems

Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine
Feature selection based on sensitivity analysis of fuzzy ISODATA

Neurocomputing
Feature evaluation and selection with cooperative game theory

Pattern Recognition
A semi-supervised feature ranking method with ensemble learning

Pattern Recognition Letters
ReinSel: A class-based mechanism for feature selection in ensemble of classifiers

Applied Soft Computing
Feature selection using dynamic weights for classification

Knowledge-Based Systems
Supervised pre-processing approaches in multiple class variables classification for fish recruitment forecasting

Environmental Modelling & Software
Multiple gene sets for cancer classification using gene range selection based on random forest

ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part I
An ensemble of SVM classifiers based on gene pairs

Computers in Biology and Medicine
Multiclass Gene Selection Using Pareto-Fronts

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Resampling methods for quality assessment of classifier performance and optimal number of features

Signal Processing
Rapid-transform based rotation invariant descriptor for texture classification under non-ideal conditions

Pattern Recognition
Filter-based optimization techniques for selection of feature subsets in ensemble systems

Expert Systems with Applications: An International Journal
On selecting interacting features from high-dimensional data

Computational Statistics & Data Analysis
MaskedPainter: Feature selection for microarray data analysis

Intelligent Data Analysis

Quantified Score

Hi-index	0.01

Visualization

Abstract

Contemporary biological technologies produce extremely high-dimensional data sets from which to design classifiers, with 20,000 or more potential features being common place. In addition, sample sizes tend to be small. In such settings, feature selection is an inevitable part of classifier design. Heretofore, there have been a number of comparative studies for feature selection, but they have either considered settings with much smaller dimensionality than those occurring in current bioinformatics applications or constrained their study to a few real data sets. This study compares some basic feature-selection methods in settings involving thousands of features, using both model-based synthetic data and real data. It defines distribution models involving different numbers of markers (useful features) versus non-markers (useless features) and different kinds of relations among the features. Under this framework, it evaluates the performances of feature-selection algorithms for different distribution models and classifiers. Both classification error and the number of discovered markers are computed. Although the results clearly show that none of the considered feature-selection methods performs best across all scenarios, there are some general trends relative to sample size and relations among the features. For instance, the classifier-independent univariate filter methods have similar trends. Filter methods such as the t-test have better or similar performance with wrapper methods for harder problems. This improved performance is usually accompanied with significant peaking. Wrapper methods have better performance when the sample size is sufficiently large. ReliefF, the classifier-independent multivariate filter method, has worse performance than univariate filter methods in most cases; however, ReliefF-based wrapper methods show performance similar to their t-test-based counterparts.