The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

  • Authors:
  • Olivier Bastien;Philippe Ortet;Sylvaine Roy;Eric Maréchal

  • Affiliations:
  • UMR 5168 CNRS-CEA-INRA-Université Joseph Fourier, Laboratoire de Physiologie Cellulaire Végétale, Département Réponse et Dynamique Cellulaires, CEA Grenoble, 17 rue des Ma ...;Département d'Ecophysiologie Végétale et de Microbiologie, CEA Cadarache, F-13108 Saint Paul-lez-Durance, France;Laboratoire Biologie, Informatique, Mathématiques, Département Réponse et Dynamique Cellulaires, CEA Grenoble, 17 rue des Martyrs, F-38054 Grenoble cedex 09, France;UMR 5168 CNRS-CEA-INRA-Université Joseph Fourier, Laboratoire de Physiologie Cellulaire Végétale, Département Réponse et Dynamique Cellulaires, CEA Grenoble, 17 rue des Ma ...

  • Venue:
  • Future Generation Computer Systems
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most of the millions of virtual protein sequences deduced from genomic DNA, and the millions to come, will not be experimentally confirmed, neither their function directly analyzed. The exploration of the majority of the protein space relies on our ability to extrapolate the portion of knowledge on characterized sequences to unknown sequences. In this paper we analyzed the large scale comparisons of hundreds of thousands of protein sequences that have been previously carried out using the power of supercomputers or grid frameworks. Following these comparisons, pragmatic rules were used to reduce protein diversity, but none was based on a rigorous and robust framework. We examined how projection of sequences in the configuration space of homologous proteins (CSHP) could help in providing a theoretically robust and long-term practical solution to help organize the protein space. The CSHP can be constructed from the output of any all-by-all pair-wise comparison in which Z-values were computed after Monte Carlo simulations. Reduction of protein diversity can be carried out according to an evolutionary model raising consistent phylogenetic clusters. Projection in the CSHP can be easily updated after sequence database updates, and the accuracy of the phylogenetic topology can be upgraded by improving sub-models. Clusters of homologous proteins can be represented as phylogenetic trees (TULIP trees). In this paper, we showed that the CSHP projection can be used to process the outputs of previous massive comparison projects based on Z-value statistics, given minor corrections for uncollected low values and we propose guidelines for future generations of massive protein sequence comparison projects.