A Screening Method for Z-Value Assessment Based on the Normalized Edit Distance

Authors:
Guillermo Peris;Andrés Marzal
Affiliations:
Universitat Jaume I (Castelló), Spain;Universitat Jaume I (Castelló), Spain
Venue:
IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part II: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living
Year:
2009

Citing 5
Cited 0

The String-to-String Correction Problem

Journal of the ACM (JACM)
Computation of Normalized Edit Distance and Applications

IEEE Transactions on Pattern Analysis and Machine Intelligence
Fast Computation of Normalized Edit Distances

IEEE Transactions on Pattern Analysis and Machine Intelligence
An Efficient Uniform-Cost Normalized Edit Distance Algorithm

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Pairwise global alignment scores are used to detect related sequences in genome and proteins. These scores are biased by the length and composition of the compared sequences, and the Z-value is used to estimate their statistical significance. The Z-value is computed using a Monte Carlo algorithm that requires a large number of pairwise alignments between random permutations of the sequences compared. A different alignment score, the normalized edit distance , is independent of the sequence lengths, and it usually takes 2 or 3 standard alignment calculations. In this paper we study the relationship between the normalized edit distance and the Z-value, and propose a method to screen pairs of unrelated sequences, so that Z-value needs to be computed for a small percentage of sequence pairs. We apply this method to the comparison of proteins from Saccharomyces cerevisiae , Escherichia coli , Methanococcus jannaschii and Haemophilus influenzae , showing that Z-value has to be computed for less than 1% of all protein pairs.