Towards a DNA sequencing theory (learning a string)

  • Authors:
  • M. Li

  • Affiliations:
  • Waterloo Univ., Ont., Canada

  • Venue:
  • SFCS '90 Proceedings of the 31st Annual Symposium on Foundations of Computer Science
  • Year:
  • 1990

Quantified Score

Hi-index 0.00

Visualization

Abstract

Mathematical frameworks suitable for massive automated DNA sequencing and for analyzing DNA sequencing algorithms are studied under plausible assumptions. The DNA sequencing problem is modeled as learning a superstring from its randomly drawn substrings. Under certain restrictions, this may be viewed as learning a superstring in L.G. Valiant's (1984) learning model, and in this case the author gives an efficient algorithm for learning a superstring and a quantitative bound on how many samples suffice. A major obstacle to the approach turns out to be a quite well-known open question on how to approximate the shortest common superstring of a set of strings. The author presents the first provably good algorithm that approximates the shortest superstring of length n by a superstring of length O(n log n).