Estimating pairwise statistical significance of protein local alignments using a clustering-classification approach based on amino acid composition

Authors:
Ankit Agrawal;Arka Ghosh;Xiaoqiu Huang
Affiliations:
Department of Computer Science, Iowa State University, Ames, IA;Department of Statistics, Iowa State University, Ames, IA;Department of Computer Science, Iowa State University, Ames, IA
Venue:
ISBRA'08 Proceedings of the 4th international conference on Bioinformatics research and applications
Year:
2008

Citing 5
Cited 2

A time-efficient, linear-space local similarity algorithm

Advances in Applied Mathematics
Rapid significance estimation in local sequence alignment with gaps

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Rapid Assessment of Extremal Statistics for Gapped Local Alignment

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Convergent Island Statistics: a fast method for determining local alignment score significance

Bioinformatics
Pairwise statistical significance versus database statistical significance for local alignment of protein sequences

ISBRA'08 Proceedings of the 4th international conference on Bioinformatics research and applications

Pairwise statistical significance versus database statistical significance for local alignment of protein sequences

ISBRA'08 Proceedings of the 4th international conference on Bioinformatics research and applications
FPGA architecture for pairwise statistical significance estimation

International Journal of High Performance Systems Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

A central question in pairwise sequence comparison is assessingthe statistical significance of the alignment. The alignment scoredistribution is known to follow an extreme value distribution with analyticallycalculable parameters K and λ for ungapped alignments withone substitution matrix. But no statistical theory is currently availablefor the gapped case and for alignments using multiple scoring matrices,although their score distribution is known to closely follow extremevalue distribution and the corresponding parameters can be estimated bysimulation. Ideal estimation would require simulation for each sequencepair, which is impractical. In this paper, we present a simple clusteringclassificationapproach based on amino acid composition to estimate Kand λ for a given sequence pair and scoring scheme, including using multipleparameter sets. The resulting set of K and λ for different clusterpairs has large variability even for the same scoring scheme, underscoringthe heavy dependence of K and λ on the amino acid composition. Theproposed approach in this paper is an attempt to separate the influenceof amino acid composition in estimation of statistical significance of pairwiseprotein alignments. Experiments and analysis of other approachesto estimate statistical parameters also indicate that the methods used inthis work estimate the statistical significance with good accuracy.