Protein blocks versus hydrogen bonds based alphabets for protein structure classification

  • Authors:
  • Dino Franklin

  • Affiliations:
  • FACOM - Federal University of Uberlandia, Uberlândia, Brazil

  • Venue:
  • Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Fragment-based descriptions of protein structures have been successfully used in fast comparison methods for protein structures mining and classification. These descriptions reduce the dimensionality problem and allow the application of sequence alignment techniques in structural comparison of proteins. Most of fragment-based alphabets were derived from secondary structure H-bond patterns or from local substructures clustering. Both approaches have shown promising results in protein mining and classification, though their accuracy is still lower than the accurate geometrical methods. In this paper, we describe two original H-bond based alphabets, HB_A and HB_B, obtained from clustering experimentally determined protein structures. We compare these two new H-bond based alphabets with two secondary-structure (DSSP and STR) based and two backbone-fragment (KL and Q16) based alphabets. Information theory analysis showed that the information content is proportional to the size and coverage of the structural alphabets and that the alphabets of the same classes are more similar between themselves. Amino acid sequences shared more information with the new H-bond based alphabets than with other alphabets, though they presented the lowest mutual information with the sequences of secondary-structure based alphabets. The comparison of alignments obtained using the Smith-Waterman algorithm showed that similar classes have similar alignments and that the most dissimilar alignments of alphabets of a same class were those from HB_A and HB_B. H-bond based alphabets presented the best performances for protein classification using First-Nearest Neighbor and three different similarity measures: the scores of alignments obtained from the Smith-Waterman algorithm; the inner products of the normalized vectors from N-GRAM (N=1, 2, 3 and 4) decomposition of the sequences; and the probabilities of belonging to the Hidden Markov Model of every training group of the dataset. In addition, we showed that using First-Nearest Neighbor and the Log-Likelihood Ratio index of Needleman-Wunsch algorithm scores, the H-bond based alphabets presented performances very close to DALI for structure based protein classification.