Information-theoretic approaches to SVM feature selection for metagenome read classification

Authors:
Elaine Garbarine;Joseph DePasquale;Vinay Gadia;Robi Polikar;Gail Rosen
Affiliations:
Electrical and Computer Engineering Department, Drexel University, 3141 Chestnut St., Philadelphia, PA 19104, USA;Electrical and Computer Engineering Department, Rowan University, 201 Mullhica Rd., Glassboro, NJ 08028, USA;Electrical and Computer Engineering Department, Drexel University, 3141 Chestnut St., Philadelphia, PA 19104, USA;Electrical and Computer Engineering Department, Rowan University, 201 Mullhica Rd., Glassboro, NJ 08028, USA;Electrical and Computer Engineering Department, Drexel University, 3141 Chestnut St., Philadelphia, PA 19104, USA
Venue:
Computational Biology and Chemistry
Year:
2011

Citing 14
Cited 0

Knowing what doesn't matter: exploiting the omission of irrelevant data

Artificial Intelligence - Special issue on relevance
The Random Subspace Method for Constructing Decision Forests

IEEE Transactions on Pattern Analysis and Machine Intelligence
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
An information-theoretic perspective of tf—idf measures

Information Processing and Management: an International Journal
Minimum Redundancy Feature Selection from Microarray Gene Expression Data

CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
An introduction to variable and feature selection

The Journal of Machine Learning Research
Dimensionality reduction via sparse support vector machines

The Journal of Machine Learning Research
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Feature extraction by non parametric mutual information maximization

The Journal of Machine Learning Research
Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy

IEEE Transactions on Pattern Analysis and Machine Intelligence
Resampling methods for parameter-free and robust feature selection with mutual information

Neurocomputing
Automatic recognition and annotation of gene expression patterns of fly embryos

Bioinformatics
Information Discriminant Analysis: Feature Extraction with an Information-Theoretic Objective

IEEE Transactions on Pattern Analysis and Machine Intelligence
Information Theoretic Learning: Renyi's Entropy and Kernel Perspectives

Information Theoretic Learning: Renyi's Entropy and Kernel Perspectives

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback-Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N=6 for all taxonomic levels.