On the Approximation of Shortest Common Supersequencesand Longest Common Subsequences

  • Authors:
  • Tao Jiang;Ming Li

  • Affiliations:
  • -;-

  • Venue:
  • SIAM Journal on Computing
  • Year:
  • 1995

Quantified Score

Hi-index 0.01

Visualization

Abstract

The problems of finding shortest common supersequences (SCS) and longest common subsequences (LCS) are two well-known {\bf NP}-hard problems that have applications in many areas including computational molecular biology, data compression, robot motion planning and scheduling, text editing, etc. A lot of fruitless effort has been spent in searching for good approximation algorithms for these problems. In this paper, we show that these problems are inherently hard to approximate in the worst case. In particular, we prove that (i) SCS does not have a polynomial time linear approximation algorithm, unless {\bf P} = {\bf NP}; (ii) There exists a constant $\delta 0$ such that, if SCS has a polynomial time approximation algorithm with ratio $\log^{\delta} n$, where $n$ is the number of input sequences, then {\bf NP} is contained in {\bf DTIME}$(2^{\polylog n})$; (iii) There exists a constant $\delta 0$ such that, if LCS has a polynomial time approximation algorithm with performance ratio $n^{\delta}$, then {\bf P} = {\bf NP}. The proofs utilize the recent results of Arora et al. [Proc. 23rd IEEE Symposium on Foundations of Computer Science, 1992, pp. 14-23] on the complexity of approximation problems. In the second part of the paper, we introduce a new method for analyzing the average-case performance of algorithms for sequences, based on Kolmogorov complexity. Despite the above nonapproximability results, we show that near optimal solutions for both SCS and LCS can be found on the average. More precisely, consider a fixed alphabet $\Sigma$ and suppose that the input sequences are generated randomly according to the uniform probability distribution and are of the same length $n$. Moreover, assume that the number of input sequences is polynomial in $n$. Then, there are simple greedy algorithms which approximate SCS and LCS with expected additive errors $O(n^{0.707})$ and $O(n^{\frac{1}{2}+\epsilon})$ for any $\epsilon 0$, respectively. Incidentally, our analyses also provide tight upper and lower bounds on the expected LCS and SCS lengths for a set of random sequences, solving a generalization of another well-known open question on the expected LCS length for two random sequences [K. Alexander, The rate of convergence of the mean length of the longest common subsequence, 1992, manuscript],[V. Chvatal and D. Sankoff, J. Appl. Probab., 12 (1975), pp. 306-315], [D. Sankoff and J. Kruskall, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, MA, 1983].