Predicting incomplete gene microarray data with the use of supervised learning algorithms

Authors:
Bhekisipho Twala;Motee Phorah
Affiliations:
Department of Electrical and Electronic Engineering Science, Faculty of Engineering and Built Environment, University of Johannesburg, P.O. Box 524, Auckland Park 2006, Johannesburg, South Africa;Modelling and Digital Sciences, Council of Scientific and Industrial Research (CSIR), Digital Intelligence Research Group, P.O. Box 395, Pretoria 0001, South Africa
Venue:
Pattern Recognition Letters
Year:
2010

Citing 16
Cited 2

Statistical analysis with missing data

Statistical analysis with missing data
The nature of statistical learning theory

The nature of statistical learning theory
Machine learning of rules and trees

Machine learning, neural and statistical classification
Missing value estimation for DNA microarray gene expression data: local least squares imputation

Bioinformatics
Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data

Bioinformatics
Towards clustering of incomplete microarray data without the use of imputation

Bioinformatics
Cancer gene search with data-mining and genetic algorithms

Computers in Biology and Medicine
Dealing with Missing Values in a Probabilistic Decision Tree during Classification

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
On Classification with Incomplete Data

IEEE Transactions on Pattern Analysis and Machine Intelligence
Handling Missing Values when Applying Classification Models

The Journal of Machine Learning Research
Good methods for coping with missing data in decision trees

Pattern Recognition Letters
Impact of imputation of missing values on classification error for discrete data

Pattern Recognition
Sequential local least squares imputation estimating missing value of microarray data

Computers in Biology and Medicine
Research Article: Robust data imputation

Computational Biology and Chemistry
K nearest neighbours with mutual information for simultaneous classification and missing data imputation

Neurocomputing
AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES

Applied Artificial Intelligence

InstanceRank based on borders for instance selection

Pattern Recognition
WIMP: Web server tool for missing data imputation

Computer Methods and Programs in Biomedicine

Quantified Score

Hi-index	0.10

Visualization

Abstract

Motivation: With the wealth of sequence data and the huge amount of data generated from molecular technologies, the issue of gene classification/prediction has become a central challenge in the field of microarray data analysis. This has led to the application of many well-established supervised learning (SL) algorithms in an attempt to provide more accurate and automatic diagnosis class (cancer/non cancer) prediction. Virtually all research on SL addresses the task of learning to classify complete domain instances. However, in some research situations we often have to classify instances given incomplete vectors, which can affect the predictive accuracy of learned classifiers. The task of learning an accurate incomplete data classifier from instances raises a number of new issues some of which have not been properly addressed by bioinformatics research. Thus, an effective missing value estimation method is required for improving predictive accuracy. Results: The essence of the approach is the proposal that prediction using supervised learning can be improved in probabilistic terms given incomplete microarray data. This imputation approach is based on the a priori probability of each value determined from the instances at that node of a decision tree (PDT) that have specified values. The proposed approach exploits the total probability and Bayes' theorems and it has three versions. We evaluate our approach with other supervised learning techniques including C5.0, classification and regression trees (CART), k-nearest neighbour (k-NN), linear discrimination (LD) naive Bayes classifier (NBC), Repeated Incremental Pruning to Produce Error Reduction (RIPPER) and support vector machines (SVMs), from the point of view of their effect or tolerance of incomplete test data. Eight cancer related gene expression datasets are utilized for this task. Experimental results are provided to illustrate the efficiency and the robustness of the proposed algorithm.