Exact computation of protein structure similarity

  • Authors:
  • L. Paul Chew

  • Affiliations:
  • Cornell University, Ithaca, NY

  • Venue:
  • Proceedings of the twenty-second annual symposium on Computational geometry
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

A protein can be considered as a string (on the alphabet of 20 amino acids) or as a structure (each protein folds into a particular 3D configuration). Consider the following string-based problem: Given two protein strings that are not necessarily similar in their entirety, determine the most similar contiguous substrings, one from each protein. The exact meaning of most similar here is determined by the user; it is based on user-specified scores for character vs. character similarity and for character vs. space similarity. It is important to allow for spaces or gaps because evolutionary changes to proteins often involve insertion or deletion of one or more individual amino acids. For this kind of string-based similarity, the most-similar substrings can be determined in time O(mn) using Dynamic Programming (DP).The goal here is to design an algorithm for similarity of protein structures as opposed to protein strings. The inspiration for our algorithm is drawn from the DP-based similarity algorithm for strings. Instead of comparing sequences of characters, we compare sequences of vectors. One complication for working with structures instead of strings is the problem of orientation: basically, two structures that have similar shape can "look different" if they are at different orientations. Algorithmically, this means that we must establish the optimal orientations for our two proteins as well as finding the similar subsequences. In other words, an algorithm for similarity of structures involves both discrete optimization (to find the corresponding subsequences) and continuous optimization (to find the optimal orientation). Interestingly, if the correspondence is given then the optimal orientation (for that correspondence) is easy to find, and if the the orientation is given then the optimal correspondence (for that orientation) is easy to find. The challenge is to accomplish both optimizations at once. Note that the technique presented here produces a globally optimal solution; there are no approximations or assumptions of randomness.