String barcoding: uncovering optimal virus signatures

Authors:
Sam Rash;Dan Gusfield
Affiliations:
University of California at Davis, Davis, CA;University of California at Davis, Davis, CA
Venue:
Proceedings of the sixth annual international conference on Computational biology
Year:
2002

Citing 4
Cited 11

Introduction to algorithms

Introduction to algorithms
On the primer selection problem in polymerase chain reaction experiments

Discrete Applied Mathematics - Special volume on computational molecular biology
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness

Tight approximability results for test set problems in bioinformatics

Journal of Computer and System Sciences
Integer linear programming approaches for non-unique probe selection

Discrete Applied Mathematics
Highly scalable algorithms for robust string barcoding

International Journal of Bioinformatics Research and Applications
Faster Algorithm for the Set Variant of the String Barcoding Problem

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Bayesian Optimization Algorithm for the Non-unique Oligonucleotide Probe Selection Problem

PRIB '09 Proceedings of the 4th IAPR International Conference on Pattern Recognition in Bioinformatics
Two challenges in genomics that can benefit from petascale platforms

Euro-Par'06 Proceedings of the CoreGRID 2006, UNICORE Summit 2006, Petascale Computational Biology and Bioinformatics conference on Parallel processing
Optimal decoding and minimal length for the non-unique oligonucleotide probe selection problem

Neurocomputing
Effective algorithms for fusion gene detection

WABI'10 Proceedings of the 10th international conference on Algorithms in bioinformatics
The string barcoding problem is NP-Hard

RCG'05 Proceedings of the 2005 international conference on Comparative Genomics
Highly scalable algorithms for robust string barcoding

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
Asynchronous Teams for probe selection problems

Discrete Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

There are many critical situations when one needs to rapidly identify an unidentified pathogen from among a given set of previously sequenced pathogens. DNA or RNA hybridization chips can be designed for such identifications. Each cell in the chip can report the presence or absence of a specific substring of DNA in the unidentified pathogen. Properly designed, the collection of reports obtained from the cells can uniquely identify any pathogen in the set, or determine that the unidentified pathogen is not in the set. There is a limit to the number of cells on a chip, and a range of substring lengths that a cell can handle. So, given the full sequences of a set of pathogens, the problem is to design the chip by selecting the smallest set of substrings of the appropriate lengths, so that each pathogen in the set has a unique set of cells that report a substring. For any given pathogen, the set of reporting cells is its signature, and hence the entire system is a "barcode" system for the pathogens.Previous work addressed this problem [1], but focused on pathogens of bacterial size, and hence had to make many compromises for the sake of efficiency. The substrings lengths were severely restricted, and no optimality or near-optimality was guaranteed. In this paper, we focus on viral-size pathogens. We show that for genomes of this size, it is practical to solve the barcode design problem optimally, or near-optimally, without artificially constraining the problem. We also efficiently find barcodes that provide a level of redundancy, tolerating a number of errors or mutations. The key technical ideas are the use of suffix trees to identify the critical substrings, integer-linear programming (ILP) to express the minimization problem, and a simple idea that dramatically reduces the size of the ILP, allowing it to be solved efficiently by the commercial ILP solver CPLEX. We report extensive tests of our approach on various collections of virus DNA and RNA sequences.