An effective, practical and low computational cost framework for the integration of heterogeneous data to predict functional associations between proteins by means of Artificial Neural Networks

  • Authors:
  • J. P. Florido;H. Pomares;I. Rojas;A. Guillén;F. M. Ortuno;J. M. Urquiza

  • Affiliations:
  • Department of Computer Architecture and Computer Technology, CITIC-UGR, University of Granada, 18071 Granada, Spain and Andalusian Human Genome Sequencing Centre (CASEGH), Medical Genome Project, ...;Department of Computer Architecture and Computer Technology, CITIC-UGR, University of Granada, 18071 Granada, Spain;Department of Computer Architecture and Computer Technology, CITIC-UGR, University of Granada, 18071 Granada, Spain;Department of Computer Architecture and Computer Technology, CITIC-UGR, University of Granada, 18071 Granada, Spain;Department of Computer Architecture and Computer Technology, CITIC-UGR, University of Granada, 18071 Granada, Spain;Chromatin and Disease Group, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet, 08907 Barcelona, Spain

  • Venue:
  • Neurocomputing
  • Year:
  • 2013

Quantified Score

Hi-index 0.01

Visualization

Abstract

Nowadays, the uncovering of new functional relationships between proteins is one of the major goals of biological studies. For this task, the integration of evidences from heterogeneous data sources by means of machine learning methodologies has been demonstrated to be an effective way of providing a complete genome-wide functional network and more accurate inferences of new functional associations. This work presents a new framework to be used in Artificial Neural Networks (ANNs) for the task of predicting functional relationships between proteins through the integration of evidences from heterogeneous data sources. The developing of such new methodology is motivated by the problems that arise when applying ANNs to this kind of problems, namely, the computational cost of ANN optimization process due to the nature of data (large number of instances and high dimensionality). The method selects smaller representative/non-random subsets from the original data set selected for ANN optimization process, resulting in a reduction of the number of data to be trained and, consequently, the computational cost. Moreover, the fact that the subsets are not only smaller, but also representative from the original one, (i) prevents the repetition of the optimization process several times with different random subsets of data, which is commonly used to get a reliable and fair evaluation of ANN's prediction accuracy, and (ii) benefits the learning procedure in the sense of a reduction of the overfitting problem, improving, this way, the prediction ability.