String barcoding: uncovering optimal virus signatures

  • Authors:
  • Sam Rash;Dan Gusfield

  • Affiliations:
  • University of California at Davis, Davis, CA;University of California at Davis, Davis, CA

  • Venue:
  • Proceedings of the sixth annual international conference on Computational biology
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

There are many critical situations when one needs to rapidly identify an unidentified pathogen from among a given set of previously sequenced pathogens. DNA or RNA hybridization chips can be designed for such identifications. Each cell in the chip can report the presence or absence of a specific substring of DNA in the unidentified pathogen. Properly designed, the collection of reports obtained from the cells can uniquely identify any pathogen in the set, or determine that the unidentified pathogen is not in the set. There is a limit to the number of cells on a chip, and a range of substring lengths that a cell can handle. So, given the full sequences of a set of pathogens, the problem is to design the chip by selecting the smallest set of substrings of the appropriate lengths, so that each pathogen in the set has a unique set of cells that report a substring. For any given pathogen, the set of reporting cells is its signature, and hence the entire system is a "barcode" system for the pathogens.Previous work addressed this problem [1], but focused on pathogens of bacterial size, and hence had to make many compromises for the sake of efficiency. The substrings lengths were severely restricted, and no optimality or near-optimality was guaranteed. In this paper, we focus on viral-size pathogens. We show that for genomes of this size, it is practical to solve the barcode design problem optimally, or near-optimally, without artificially constraining the problem. We also efficiently find barcodes that provide a level of redundancy, tolerating a number of errors or mutations. The key technical ideas are the use of suffix trees to identify the critical substrings, integer-linear programming (ILP) to express the minimization problem, and a simple idea that dramatically reduces the size of the ILP, allowing it to be solved efficiently by the commercial ILP solver CPLEX. We report extensive tests of our approach on various collections of virus DNA and RNA sequences.