Integrating machine learning techniques into robust data enrichment approach and its application to gene expression data

Authors:
Utku Erdoğdu;Mehmet Tan;Reda Alhajj;Faruk Polat;Jon Rokne;Douglas Demetrick
Affiliations:
Department of Computer Engineering, Middle East Technical University, Ankara 06800, Turkey;Department of Computer Engineering, TOBB Economics and Technology, University Ankara, 06560, Turkey;Department of Computer Science, University of Calgary, Calgary, Alberta T2N 1N4, Canada;Department of Computer Engineering, Middle East Technical University, Ankara 06800, Turkey;Department of Computer Science, University of Calgary, Calgary, Alberta T2N 1N4, Canada;Departments of Pathology, Oncology, Medical Genetics and Medical Biochemistry, University of Calgary, Calgary, Alberta T2N 4N1, Canada
Venue:
International Journal of Data Mining and Bioinformatics
Year:
2013

Citing 8
Cited 0

Coefficient of determination in nonlinear signal processing

Signal Processing - Special section on signal processing technologies for short burst wireless communications
An Introduction to Genetic Algorithms

An Introduction to Genetic Algorithms
Uniform Crossover in Genetic Algorithms

Proceedings of the 3rd International Conference on Genetic Algorithms
Genetic Algorithms for DNA Sequence Assembly

Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
Capturing best practice for microarray gene expression data analysis

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Gene Expression Data Classification Using Artificial Neural Network Ensembles Based on Samples Filtering

AICI '09 Proceedings of the 2009 International Conference on Artificial Intelligence and Computational Intelligence - Volume 01
Mixture-model based estimation of gene expression variance from public database improves identification of differentially expressed genes in small sized microarray data

Bioinformatics
Influence of Prior Knowledge in Constraint-Based Learning of Gene Regulatory Networks

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The availability of enough samples for effective analysis and knowledge discovery has been a challenge in the research community, especially in the area of gene expression data analysis. Thus, the approaches being developed for data analysis have mostly suffered from the lack of enough data to train and test the constructed models. We argue that the process of sample generation could be successfully automated by employing some sophisticated machine learning techniques. An automated sample generation framework could successfully complement the actual sample generation from real cases. This argument is validated in this paper by describing a framework that integrates multiple models perspectives for sample generation. We illustrate its applicability for producing new gene expression data samples, a highly demanding area that has not received attention. The three perspectives employed in the process are based on models that are not closely related. The independence eliminates the bias of having the produced approach covering only certain characteristics of the domain and leading to samples skewed towards one direction. The first model is based on the Probabilistic Boolean Network PBN representation of the gene regulatory network underlying the given gene expression data. The second model integrates Hierarchical Markov Model HIMM and the third model employs a genetic algorithm in the process. Each model learns as much as possible characteristics of the domain being analysed and tries to incorporate the learned characteristics in generating new samples. In other words, the models base their analysis on domain knowledge implicitly present in the data itself. The developed framework has been extensively tested by checking how the new samples complement the original samples. The produced results are very promising in showing the effectiveness, usefulness and applicability of the proposed multi-model framework.