The shortest common superstring problem: average case analysis for both exact and approximate matching

  • Authors:
  • En-hui Yang;Zhen Zhang

  • Affiliations:
  • Dept. of Electr. & Comput. Eng., Waterloo Univ., Ont.;-

  • Venue:
  • IEEE Transactions on Information Theory
  • Year:
  • 2006

Quantified Score

Hi-index 754.84

Visualization

Abstract

The shortest common superstring problem and its extension to approximate matching are considered in the probability model where each string in a given set has the same length and letters of strings are drawn independently from a finite set. In the exact matching case, several algorithms proposed in the literature are shown to be asymptotically optimal in the sense that the ratio of the savings resulting from the superstring constructed by each of these algorithms, that is the difference between the total length of the strings in the given set and the length of the superstring, to the optimal savings from the shortest superstring approaches in probability to 1 as the number of strings in the given set increases. In the approximate matching case, a modified version of the shortest common approximate matching superstring problem is analyzed; it is demonstrated that the optimal savings in this case is given approximately by nlogn/Il(Q,Q,2D), where n is the number of strings in the given set, Q is the probability distribution governing the selection of letters of strings, Il(Q,Q,2D) is the lower mutual information between Q and Q with respect to 2D, and D⩾0 is the distortion allowed in approximate matching. In addition, an approximation algorithm is proposed and proved asymptotically optimal