Introduction to algorithms
On the primer selection problem in polymerase chain reaction experiments
Discrete Applied Mathematics - Special volume on computational molecular biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
Tight approximability results for test set problems in bioinformatics
Journal of Computer and System Sciences
Integer linear programming approaches for non-unique probe selection
Discrete Applied Mathematics
Highly scalable algorithms for robust string barcoding
International Journal of Bioinformatics Research and Applications
Faster Algorithm for the Set Variant of the String Barcoding Problem
CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Bayesian Optimization Algorithm for the Non-unique Oligonucleotide Probe Selection Problem
PRIB '09 Proceedings of the 4th IAPR International Conference on Pattern Recognition in Bioinformatics
Two challenges in genomics that can benefit from petascale platforms
Euro-Par'06 Proceedings of the CoreGRID 2006, UNICORE Summit 2006, Petascale Computational Biology and Bioinformatics conference on Parallel processing
Effective algorithms for fusion gene detection
WABI'10 Proceedings of the 10th international conference on Algorithms in bioinformatics
The string barcoding problem is NP-Hard
RCG'05 Proceedings of the 2005 international conference on Comparative Genomics
Highly scalable algorithms for robust string barcoding
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
Asynchronous Teams for probe selection problems
Discrete Optimization
Hi-index | 0.00 |
There are many critical situations when one needs to rapidly identify an unidentified pathogen from among a given set of previously sequenced pathogens. DNA or RNA hybridization chips can be designed for such identifications. Each cell in the chip can report the presence or absence of a specific substring of DNA in the unidentified pathogen. Properly designed, the collection of reports obtained from the cells can uniquely identify any pathogen in the set, or determine that the unidentified pathogen is not in the set. There is a limit to the number of cells on a chip, and a range of substring lengths that a cell can handle. So, given the full sequences of a set of pathogens, the problem is to design the chip by selecting the smallest set of substrings of the appropriate lengths, so that each pathogen in the set has a unique set of cells that report a substring. For any given pathogen, the set of reporting cells is its signature, and hence the entire system is a "barcode" system for the pathogens.Previous work addressed this problem [1], but focused on pathogens of bacterial size, and hence had to make many compromises for the sake of efficiency. The substrings lengths were severely restricted, and no optimality or near-optimality was guaranteed. In this paper, we focus on viral-size pathogens. We show that for genomes of this size, it is practical to solve the barcode design problem optimally, or near-optimally, without artificially constraining the problem. We also efficiently find barcodes that provide a level of redundancy, tolerating a number of errors or mutations. The key technical ideas are the use of suffix trees to identify the critical substrings, integer-linear programming (ILP) to express the minimization problem, and a simple idea that dramatically reduces the size of the ILP, allowing it to be solved efficiently by the commercial ILP solver CPLEX. We report extensive tests of our approach on various collections of virus DNA and RNA sequences.