Efficient selection of unique and popular oligos for large EST databases

Authors:
Jie Zheng;Timothy J. Close;Tao Jiang;Stefano Lonardi
Affiliations:
Dept. of Computer Science & Engineering, University of California, Riverside, CA;Department of Botany & Plant Sciences, University of California, Riverside, CA;Dept. of Computer Science & Engineering, University of California, Riverside, CA;Dept. of Computer Science & Engineering, University of California, Riverside, CA
Venue:
CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Year:
2003

Citing 7
Cited 2

Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization

Machine Learning - Special issue on applications in molecular biology
Finding motifs using random projections

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Monotony of surprise and large-scale quest for unusual words

Proceedings of the sixth annual international conference on Computational biology
Finding motifs in the twilight zone

Proceedings of the sixth annual international conference on Computational biology
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Polynomial-Time Algorithms for Computing Characteristic Strings

CPM '94 Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching
Rapid Large-Scale Oligonucleotide Selection for Microarrays

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics

A Efficient Algorithm for Unique Signature Discovery on Whole-Genome EST Databases

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Minimum Unique Substrings and Maximum Repeats

Fundamenta Informaticae - Theory that Counts: To Oscar Ibarra on His 70th Birthday

Quantified Score

Hi-index	0.00

Visualization

Abstract

EST databases have grown exponentially in recent years and now represent the largest collection of genetic sequences. An important application of these databases is that they contain information useful for the design of gene-specific oligonucleotides (or simply, oligos) that can be used in PCR primer design, microarray experiments, and genomic library screening. In this paper, we study two complementary problems concerning the selection of short oligos, e.g., 20-50 bases, from a large database of tens of thousands of EST sequences: (i) selection of oligos each of which appears (exactly) in one EST sequence but does not appear (exactly or approximately) in any other EST sequence and (ii) selection of oligos that appear (exactly or approximately) in many ESTs. The first problem is called the unique oligo problem and has applications in PCR primer and microarray probe designs. The second is called the popular oligo problem and is useful in screening genomic libraries (such as BAC libraries) for gene-rich regions. We present an efficient algorithm to identify all unique oligos in the ESTs and an efficient heuristic algorithm to enumerate the most popular oligos. By taking into account the distribution of the frequencies of the words in the EST database, the algorithms have been carefully engineered to achieve remarkable running times on regular PCs. Each of the algorithms takes only a couple of hours (on a 1.2 GHz CPU, 1 GB RAM machine) to run on a dataset 28 Mbases of barley ESTs from the HARVEST database. We present simulation results on synthetic data and a preliminary analysis of the barley EST database.