Computing Highly Specific and Mismatch Tolerant Oligomers Efficiently

Authors:
Tomoyuki Yamada;Shinichi Morishita
Affiliations:
-;-
Venue:
CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Year:
2003

Citing 9
Cited 1

Generalized string matching

SIAM Journal on Computing
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Approximate nearest neighbors and sequence comparison with block operations

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Faster algorithms for string matching with k mismatches

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
The LCA Problem Revisited

LATIN '00 Proceedings of the 4th Latin American Symposium on Theoretical Informatics
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Simple and Practical Sequence Nearest Neighbors with Block Operations

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Rapid Large-Scale Oligonucleotide Selection for Microarrays

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
An Approximate L1-Difference Algorithm for Massive Data Streams

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science

An efficient algorithm for finding similar short substrings from large scale string data

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The sequencing of the genomes of a variety of speciesand the growing databases containing expressed sequencetags (ESTs) and complementary DNAs (cDNAs) facilitatethe design of highly specific oligomers for use as genomicmarkers, PCR primers, or DNA oligo microarrays. Thefirst step in evaluating the specificity of short oligomers ofabout twenty units in length is to determine the frequenciesat which the oligomers occur. However, for oligomerslonger than about fifty units this is not efficient, as they usuallyhave a frequency of only 1. A more suitable procedureis to consider the mismatch tolerance of an oligomer,that is, the minimum number of mismatches that allows agiven oligomer to match a sub-sequence other than the targetsequence anywhere in the genome or the EST database.However, calculating the exact value of mismatch toleranceis computationally costly and impractical. Therefore, westudied the problem of checking whether an oligomer meetsthe constraint that its mismatch tolerance is no less than agiven threshold. Here, we present an efficient dynamic programmingalgorithm solution that utilizes suffix and heightarrays. We demonstrated the effectiveness of this algorithmby efficiently computing a dense list of oligo-markers applicableto the human genome. Experimental results show thatthe algorithm runs faster than well-known Abrahamson'salgorithm by orders of magnitude and is able to enumerate63% ~ 79% of qualified oligomers.