Utilization of virtual samples to facilitate cancer identification for DNA microarray data in the early stages of an investigation

Authors:
Der-Chiang Li;Yao-Hwei Fang;Yung-Yao Lai;Susan C. Hu
Affiliations:
Department of Industrial and Information Management, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan;Division of Biostatistics and Bioinformatics, National Health Research Institutes, No. 35, Keyan Road, Zhunan, Miaoli Country 350, Taiwan;Department of Industrial and Information Management, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan;Department of Public Health, College of Medicine, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan
Venue:
Information Sciences: an International Journal
Year:
2009

Citing 10
Cited 3

Support-Vector Networks

Machine Learning
Neural network design

Neural network design
Self-organizing maps in mining gene expression data

Information Sciences: an International Journal
Cancer classification using gene expression data

Information Systems - Special issue: Data management in bioinformatics
Nearest neighbour approach in the least-squares data imputation algorithms

Information Sciences: an International Journal
A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression

Bioinformatics
HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data

Bioinformatics
Trainable fusion rules. II. Small sample-size effects

Neural Networks
A new method to help diagnose cancers for small sample size

Expert Systems with Applications: An International Journal
Two-stage classification methods for microarray data

Expert Systems with Applications: An International Journal

A method to generate artificial 2D shape contour based in fourier transform and genetic algorithms

ACIVS'11 Proceedings of the 13th international conference on Advanced concepts for intelligent vision systems
A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting

Neurocomputing
Review: Knowledge discovery in medicine: Current issue and future trend

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.07

Visualization

Abstract

DNA microarray datasets are generally small in size, high dimensional with many non-discriminative genes, and non-linear with outliers. Their size/dimension ratio suggests that DNA microarray datasets are identified as small-sample problems. Recently, researchers have developed various gene selection algorithms to discover genes that are most relevant to a specific disease, and thus to reduce computation. Most gene selection algorithms improve learning performance and efficiency, but still suffer from the limitation of insufficient training samples in the datasets. Moreover, in the early stage of diagnosing a new disease, very limited data can be obtained. Therefore, the derived diagnostic model is usually unreliable to identify the new disease. Consequently, the diagnostic performance cannot always be robust, even with the gene selection algorithms. To solve the problem of very limited training dataset with non-linear data or outliers, we propose the method GVSG (Group Virtual Sample Generation), which is a non-linear Virtual Sample Generation algorithm. This non-linear method detects the characteristics in the very limited data, forms discrete groups of each discriminative gene, and systematically generates virtual samples for each of these to accelerate and stabilize the modeling process. The results show that this method significantly improves the learning accuracy in the early stage of DNA microarray data.