Revealing Protein Structures by Co-Occurrence Clustering of Aligned Pattern Clusters

  • Authors:
  • Sanderz Fung;En-Shiun Annie Lee;Andrew K.C. Wong

  • Affiliations:
  • Systems Design Engineering, University of Waterloo, Waterloo, Canada;Systems Design Engineering, University of Waterloo, Waterloo, Canada;Systems Design Engineering, University of Waterloo, Waterloo, Canada

  • Venue:
  • Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Proteins can be represented in several ways, including primary protein sequence, where the protein is represented as a string of amino acids, and three-dimensional structure, where the sequence is folded into a structure. By analyzing proteins from the same protein family, we can find conserved protein regions that are common within that protein family, gaining biological knowledge. Compared to the amount of protein three-dimensional structures available, there is an abundance of protein sequences, hence, making analysis of protein sequence to find characteristics in its three-dimensional structures crucial. Through sequence pattern discovery and alignment, statistically significant sequence patterns in protein families are found and represented as Aligned Pattern Clusters (APCs). When two or more APCs occur frequently together on the same protein, this implies that they together have important relationship in the protein. A co-occurrence score is used to quantify such relationship between the APCs, which are further used to cluster APCs into APC clusters. The purpose of this paper is to examine the validate of the proposed method by applying our method to two protein families, triosephosphate isomerase and G-alpha. The results are then verified using three-dimensional structures to check both to examine whether the results comply with the structure and how often with different known structures. The results for both protein families comply in majority with the known structures, and their APCs were close in three-dimensional distance. We found three characteristics that are common in the resulting APC clusters from both sets of protein data: the APC cluster forming a complete graph, the APC cluster having a high co-occurrence score, and the APC cluster containing APCs with more than one patterns. Furthermore, our method and results are currently being verified by important proteins crystallized from an immunology lab.