Gauss-integral based representation of protein structure for predicting the fold class from the sequence

  • Authors:
  • BjøRn G. Nielsen;Peter RøGen;Henrik G. Bohr

  • Affiliations:
  • Quantum Protein Centre (QuP), Department of Physics, Technical University of Denmark, Bldg. 309, DK-2800, Kongens Lyngby, Denmark;Department of Mathematics, Technical University of Denmark, Bldg. 303, DK-2800, Kongens Lyngby, Denmark;Quantum Protein Centre (QuP), Department of Physics, Technical University of Denmark, Bldg. 309, DK-2800, Kongens Lyngby, Denmark

  • Venue:
  • Mathematical and Computer Modelling: An International Journal
  • Year:
  • 2006

Quantified Score

Hi-index 0.98

Visualization

Abstract

A representative subset of protein chains were selected from the CATH 2.4 database [C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells, J.M. Thornton, CATH-a hierarchic classification of protein domain structures, Structure 5 (8) (1997) 1093-1108], and were used for training a feed-forward neural network in order to predict protein fold classes by using as input the dipeptide frequency matrix and as output a novel representation of the protein chains in R^3^0 space, based on knot invariant values [P. Rogen, B. Fain, Automatic classification of protein structure by using Gauss integrals, Proceedings of the National Academy of Sciences of the United States of America 100 (1) (2003) 119-124; P. Rogen, H.G. Bohr, A new family of global protein shape descriptors, Mathematical Biosciences 182 (2) (2003) 167-181]. In the general case when excluding singletons (proteins representing a topology or a sequence homology as unique members of these sets), the success rates for the predictions were 77% for class level, 60% for architecture, and 48% for topology. The total number of fold classes that are included in the present data set (~500) is ten times that which has been reported in earlier attempts, so this result represents an improvement on previous work (reporting on a few handpicked folds). Furthermore, distance analysis of the network outputs resulting from singletons shows that it is possible to detect novel topologies with very high confidence (~85%), and the network can in these cases be used as a sorting mechanism that identifies sequences which might need special attention. Also, a direct measure of prediction confidence may be obtained from such distance analysis.