Hardness of optimal spaced seed design

Authors:
François Nicolas;Eric Rivals
Affiliations:
L.I.R.M.M. University of Montpellier II, CNRS U.M.R. 5506, Montpellier Cedex 5, France;L.I.R.M.M. University of Montpellier II, CNRS U.M.R. 5506, Montpellier Cedex 5, France
Venue:
CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Year:
2005

Citing 6
Cited 6

A threshold of ln n for approximating set cover

Journal of the ACM (JACM)
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
FLASH: A Fast Look-Up Algorithm for String Homology

Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
Better filtering with gapped q-grams

Fundamenta Informaticae - Special issue on computing patterns in strings
Parameterized Complexity

Parameterized Complexity

Superiority and complexity of the spaced seeds

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
On the complexity of the spaced seeds

Journal of Computer and System Sciences
Graph connectivity, partial words, and a theorem of Fine and Wilf

Information and Computation
Hardness of optimal spaced seed design

Journal of Computer and System Sciences
Amino Acid Classification and Hash Seeds for Homology Search

BICoB '09 Proceedings of the 1st International Conference on Bioinformatics and Computational Biology
Seed optimization for i.i.d. similarities is no easier than optimal Golomb ruler design

Information Processing Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Speeding up approximate pattern matching is a line of research in stringology since the 80's. Practically fast approaches belong to the class of filtration algorithms, in which text regions dissimilar to the pattern are excluded (filtered out) in a first step, and remaining regions are compared to the pattern by dynamic programming in a second step. Among the necessary conditions used to test similarity between the regions and the pattern, many require a minimum number of common substrings between them. When only substitutions are taken into account for measuring dissimilarity, it was shown recently that counting spaced subwords instead of substrings improve the filtration efficiency. However, a preprocessing step is required to design one or more patterns, called gapped seeds, for the subwords, depending on the search parameters. The seed design problems proposed up to now differ by the way the similarities to detect are given: either a set of similarities is given in extenso (this is a “region specific” problem), or one wishes to detect all similar regions having at most k substitutions (general detection problem). Several articles exhibit exponential algorithms for these problems. In this work, we provide hardness and inapproximability results for both the region specific and general seed design problems, thereby justifying the exponential complexity of known algorithms. Moreover, we introduce a new formulation of the region specific seed design problem, in which the weight of the seed (i.e., number of characters in the subwords) has to be maximized, and show it is as difficult to approximate than Maximum Independent Set.