Linear approximation of shortest superstrings

  • Authors:
  • Avrim Blum;Tao Jiang;Ming Li;John Tromp;Mihalis Yannakakis

  • Affiliations:
  • Massachusetts Institute of Technology, Cambridge;McMaster Univ., Hamilton, Ont., Canada;Univ. of Waterloo, Waterloo, Ont., Canada;CWI, Amsterdam, The Netherlands;AT&T Bell Labs, Murray Hill, NJ

  • Venue:
  • Journal of the ACM (JACM)
  • Year:
  • 1994

Quantified Score

Hi-index 0.01

Visualization

Abstract

We consider the following problem: given a collection of strings s1,…, sm, find the shortest string s such that each si appears as a substring (a consecutive block) of s. Although this problem is known to be NP-hard, a simple greedy procedure appears to do quite well and is routinely used in DNA sequencing and data compression practice, namely: repeatedly merge the pair of (distinct) strings with maximum overlap until only one string remains. Let n denote the length of the optimal superstring. A common conjecture states that the above greedy procedure produces a superstring of length O(n) (in fact, 2n), yet the only previous nontrivial bound known for any polynomial-time algorithm is a recent O(n log n) result.We show that the greedy algorithm does in fact achieve a constant factor approximation, proving an upper bound of 4n. Furthermore, we present a simple modified version of the greedy algorithm that we show produces a superstring of length at most 3n. We also show the superstring problem to be MAXSNP-hard, which implies that a polynomial-time approximation scheme for this problem is unlikely.