Conquering the curse of dimensionality in gene expression cancer diagnosis: tough problem, simple models

Authors:
Minca Mramor;Gregor Leban;Janez Demšar;Blaž Zupan
Affiliations:
Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia;Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia;Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia;Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
Venue:
AIME'05 Proceedings of the 10th conference on Artificial Intelligence in Medicine
Year:
2005

Citing 6
Cited 3

A practical approach to feature selection

ML92 Proceedings of the ninth international workshop on Machine learning
Theoretical and Empirical Analysis of ReliefF and RReliefF

Machine Learning
Induction of comprehensible models for gene expression datasets by subgroup discovery methodology

Journal of Biomedical Informatics - Special issue: Biomedical machine learning
A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis

Bioinformatics
VizRank: finding informative data projections in functional genomics by machine learning

Bioinformatics
Gene selection from microarray data for cancer classification-a machine learning approach

Computational Biology and Chemistry

Summarizing gene-expression-based classifiers by meta-mining comprehensible relational patterns

BioMed'06 Proceedings of the 24th IASTED international conference on Biomedical engineering
Methodological Review: Towards knowledge-based gene expression data mining

Journal of Biomedical Informatics
Hybrid genetic algorithm-neural network: Feature extraction for unpreprocessed microarray data

Artificial Intelligence in Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the paper we study the properties of cancer gene expression data sets from the perspective of classification and tumor diagnosis. Our findings and case studies are based on several recently published data sets. We find that these data sets typically include a subset of about 100 highly discriminating features of which predictive power can be further enhanced by exploring their interactions. This finding speaks against often used univariate feature selection methods, and may explain the superior performance of support vector machines recently reported in the related work. We argue that a much simpler technique that directly finds visualizations with clear separation of diagnostic classes may be used instead. Furthermore, it may perform better in inference of an understandable classifier that includes only a few relevant features.