A rank sum test method for informative gene discovery

Authors:
Lin Deng;Jian Pei;Jinwen Ma;Dik Lun Lee
Affiliations:
Hong Kong University of Science and Technology, Hong Kong, China;State University of New York at Buffalo, NY and Simon Fraser University, Canada;Peking University, Beijing, China;Hong Kong University of Science and Technology, Hong Kong, China
Venue:
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2004

Citing 9
Cited 9

Making large-scale support vector machine learning practical

Advances in kernel methods
Class prediction and discovery using gene expression data

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
Gene functional classification from heterogeneous data

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Analysis of gene expression profiles: class discovery and leaf ordering

Proceedings of the sixth annual international conference on Computational biology
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
Gene selection criterion for discriminant microarray data analysis based on extreme value distributions

RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
Discovering Compact and Highly Discriminative Features or Feature Combinations of Drug Activities Using Support Vector Machines

CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Mining coherent gene clusters from gene-sample-time microarray data

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Multiclass Cancer Classification Using Semisupervised Ellipsoid ARTMAP and Particle Swarm Optimization with Gene Expression Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Improving k-Nearest Neighbour Classification with Distance Functions Based on Receiver Operating Characteristics

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Neighborhood Rough Set Model Based Gene Selection for Multi-subtype Tumor Classification

ICIC '08 Proceedings of the 4th international conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications - with Aspects of Theoretical and Methodological Issues
A Novel Hybrid Method of Gene Selection and Its Application on Tumor Classification

ICIC '08 Proceedings of the 4th international conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications - with Aspects of Artificial Intelligence
Extraction of informative genes from integrated microarray data

ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
Multi-step dimensionality reduction and semi-supervised graph-based tumor classification using gene expression data

Artificial Intelligence in Medicine
Feature selection for support vector machines with RBF kernel

Artificial Intelligence Review
A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Recursive feature elimination based on linear discriminant analysis for molecular selection and classification of diseases

ICIC'13 Proceedings of the 9th international conference on Intelligent Computing Theories and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding informative genes from microarray data is an important research problem in bioinformatics research and applications. Most of the existing methods rank features according to their discriminative capability and then find a subset of discriminative genes (usually top k genes). In particular, t-statistic criterion and its variants have been adopted extensively. This kind of methods rely on the statistics principle of t-test, which requires that the data follows a normal distribution. However, according to our investigation, the normality condition often cannot be met in real data sets.To avoid the assumption of the normality condition, in this paper, we propose a rank sum test method for informative gene discovery. The method uses a rank-sum statistic as the ranking criterion. Moreover, we propose using the significance level threshold, instead of the number of informative genes, as the parameter. The significance level threshold as a parameter carries the quality specification in statistics. We follow the Pitman efficiency theory to show that the rank sum method is more accurate and more robust than the t-statistic method in theory.To verify the effectiveness of the rank sum method, we use support vector machine (SVM) to construct classifiers based on the identified informative genes on two well known data sets, namely colon data and leukemia data. The prediction accuracy reaches 96.2% on the colon data and 100% on the leukemia data. The results are clearly better than those from the previous feature ranking methods. By experiments, we also verify that using significance level threshold is more effective than directly specifying an arbitrary k.