Estimating pairwise statistical significance of protein local alignments using a clustering-classification approach based on amino acid composition

  • Authors:
  • Ankit Agrawal;Arka Ghosh;Xiaoqiu Huang

  • Affiliations:
  • Department of Computer Science, Iowa State University, Ames, IA;Department of Statistics, Iowa State University, Ames, IA;Department of Computer Science, Iowa State University, Ames, IA

  • Venue:
  • ISBRA'08 Proceedings of the 4th international conference on Bioinformatics research and applications
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

A central question in pairwise sequence comparison is assessingthe statistical significance of the alignment. The alignment scoredistribution is known to follow an extreme value distribution with analyticallycalculable parameters K and λ for ungapped alignments withone substitution matrix. But no statistical theory is currently availablefor the gapped case and for alignments using multiple scoring matrices,although their score distribution is known to closely follow extremevalue distribution and the corresponding parameters can be estimated bysimulation. Ideal estimation would require simulation for each sequencepair, which is impractical. In this paper, we present a simple clusteringclassificationapproach based on amino acid composition to estimate Kand λ for a given sequence pair and scoring scheme, including using multipleparameter sets. The resulting set of K and λ for different clusterpairs has large variability even for the same scoring scheme, underscoringthe heavy dependence of K and λ on the amino acid composition. Theproposed approach in this paper is an attempt to separate the influenceof amino acid composition in estimation of statistical significance of pairwiseprotein alignments. Experiments and analysis of other approachesto estimate statistical parameters also indicate that the methods used inthis work estimate the statistical significance with good accuracy.