Pairwise statistical significance versus database statistical significance for local alignment of protein sequences

  • Authors:
  • Ankit Agrawal;Volker Brendel;Xiaoqiu Huang

  • Affiliations:
  • Department of Computer Science, Iowa State University, Ames, IA;Department of Genetics, Development and Cell Biology and Department of Statistics, Iowa State University, Ames, IA;Department of Computer Science, Iowa State University, Ames, IA

  • Venue:
  • ISBRA'08 Proceedings of the 4th international conference on Bioinformatics research and applications
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

An important aspect of pairwise sequence comparison is assessingthe statistical significance of the alignment. Most of the currentlypopular alignment programs report the statistical significance ofan alignment in context of a database search. This database statisticalsignificance is dependent on the database, and hence, the same alignmentof a pair of sequences may be assessed different statistical significancevalues in different databases. In this paper, we explore the use of pairwisestatistical significance, which is independent of any database, andcan be useful in cases where we only have a pair of sequences and wewant to comment on the relatedness of the sequences, independent of anydatabase. We compared different methods and determined that censoredmaximum likelihood fitting the score distribution right of the peak is themost accurate method for estimating pairwise statistical significance. Weevaluated this method in an experiment with a subset of CATH2.3, whichhad been previoulsy used by other authors as a benchmark data set forprotein comparison. Comparison of results with database statistical significancereported by popular programs like SSEARCH and PSI-BLAST indicate that the results of pairwise statistical significance are comparable,indeed sometimes significantly better than those of database statisticalsignificance (with SSEARCH). However, PSI-BLAST performs best,presumably due to its use of query-specific substitution matrices.