Selection of relevant genes in cancer diagnosis based on their prediction accuracy

Authors:
Rosalia Maglietta;Annarita D'Addabbo;Ada Piepoli;Francesco Perri;Sabino Liuni;Graziano Pesole;Nicola Ancona
Affiliations:
Istituto di Studi sui Sistemi Intelligenti per l'Automazione, CNR Via Amendola 122/D-I, 70126 Bari, Italy;Istituto di Studi sui Sistemi Intelligenti per l'Automazione, CNR Via Amendola 122/D-I, 70126 Bari, Italy;Unití Operativa di Gastroenterologia, IRCCS, "Casa Sollievo della Sofferenza"-Ospedale, Viale Cappuccini, 71013 San Giovanni Rotondo (FG), Italy;Unití Operativa di Gastroenterologia, IRCCS, "Casa Sollievo della Sofferenza"-Ospedale, Viale Cappuccini, 71013 San Giovanni Rotondo (FG), Italy;Istituto di Tecnologie Biomediche, Sede di Bari, CNR Via Amendola 122/D, 70126 Bari, Italy;Istituto di Tecnologie Biomediche, Sede di Bari, CNR Via Amendola 122/D, 70126 Bari, Italy and Dipartimento di Biochimica e Biologia Molecolare, Universitá di Bari, Via E. Orabona 4, 70126 Ba ...;Istituto di Studi sui Sistemi Intelligenti per l'Automazione, CNR Via Amendola 122/D-I, 70126 Bari, Italy
Venue:
Artificial Intelligence in Medicine
Year:
2007

Citing 7
Cited 5

The nature of statistical learning theory

The nature of statistical learning theory
Class prediction and discovery using gene expression data

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
SVM vs Regularized Least Squares Classification

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 1 - Volume 01
How many samples are needed to build a classifier: a general sequential approach

Bioinformatics
Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification

Bioinformatics
Permutation tests for classification

COLT'05 Proceedings of the 18th annual conference on Learning Theory

Investigating the Efficacy of Nonlinear Dimensionality Reduction Schemes in Classifying Gene and Protein Expression Studies

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Association of genetic profiles to Crohn's disease by linear combinations of single nucleotide polymorphisms

Artificial Intelligence in Medicine
Permutation Tests for Studying Classifier Performance

The Journal of Machine Learning Research
A fuzzy-based data transformation for feature extraction to increase classification performance with small medical data sets

Artificial Intelligence in Medicine
Evolutionary Generalized Radial Basis Function neural networks for improving prediction accuracy in gene classification using feature selection

Applied Soft Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Motivations: One of the main problems in cancer diagnosis by using DNA microarray data is selecting genes relevant for the pathology by analyzing their expression profiles in tissues in two different phenotypical conditions. The question we pose is the following: how do we measure the relevance of a single gene in a given pathology? Methods: A gene is relevant for a particular disease if we are able to correctly predict the occurrence of the pathology in new patients on the basis of its expression level only. In other words, a gene is informative for the disease if its expression levels are useful for training a classifier able to generalize, that is, able to correctly predict the status of new patients. In this paper we present a selection bias free, statistically well founded method for finding relevant genes on the basis of their classification ability. Results: We applied the method on a colon cancer data set and produced a list of relevant genes, ranked on the basis of their prediction accuracy. We found, out of more than 6500 available genes, 54 overexpressed in normal tissues and 77 overexpressed in tumor tissues having prediction accuracy greater than 70% with p-value@?@?0.05. Conclusions: The relevance of the selected genes was assessed (a) statistically, evaluating the p-value of the estimate prediction accuracy of each gene; (b) biologically, confirming the involvement of many genes in generic carcinogenic processes and in particular for the colon; (c) comparatively, verifying the presence of these genes in other studies on the same data-set.