Hardness of optimal spaced seed design

Authors:
François Nicolas;Eric Rivals
Affiliations:
LIRMM, UMR 5506, CNRS, Université de Montpellier II, 161, rue Ada, 34392 Montpellier Cedex 5, France;LIRMM, UMR 5506, CNRS, Université de Montpellier II, 161, rue Ada, 34392 Montpellier Cedex 5, France
Venue:
Journal of Computer and System Sciences
Year:
2008

Citing 21
Cited 5

Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Tiling figures of the plane with two bars

Computational Geometry: Theory and Applications
Approximation algorithms for NP-hard problems

Approximation algorithms for NP-hard problems
A threshold of ln n for approximating set cover

Journal of the ACM (JACM)
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Numerical Study of Flows of Two Immiscible Liquids at Low Reynolds Number

SIAM Review
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
FLASH: A Fast Look-Up Algorithm for String Homology

Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
Better filtering with gapped q-grams

Fundamenta Informaticae - Special issue on computing patterns in strings
Sensitivity analysis and efficient method for identifying optimal spaced seeds

Journal of Computer and System Sciences
Estimating Seed Sensitivity on Homogeneous Alignments

BIBE '04 Proceedings of the 4th IEEE Symposium on Bioinformatics and Bioengineering
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
Multiseed Lossless Filtration

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Designing seeds for similarity search in genomic DNA

Journal of Computer and System Sciences - Special issue on bioinformatics II
Superiority and complexity of the spaced seeds

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Linear degree extractors and the inapproximability of max clique and chromatic number

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Indel seeds for homology search

Bioinformatics
On the complexity of the spaced seeds

Journal of Computer and System Sciences
Optimal spaced seeds for faster approximate string matching

Journal of Computer and System Sciences
Hardness of optimal spaced seed design

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Parameterized Complexity

Parameterized Complexity

On the complexity of finding gapped motifs

Journal of Discrete Algorithms
Combinatorics on partial word correlations

Journal of Combinatorial Theory Series A
MPSCAN: fast localisation of multiple reads in genomes

WABI'09 Proceedings of the 9th international conference on Algorithms in bioinformatics
Spaced seeds design using perfect rulers

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Design and analysis of periodic multiple seeds

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Speeding up approximate pattern matching is a line of research in stringology since the 80s. Practically fast approaches belong to the class of filtration algorithms, in which text regions dissimilar to the pattern are first excluded, and the remaining regions are then compared to the pattern by dynamic programming. Among the conditions used to test similarity between the regions and the pattern, many require a minimum number of common substrings between them. When only substitutions are taken into account for measuring dissimilarity, counting spaced subwords instead of substrings improves the filtration efficiency. However, a preprocessing step is required to design one or more patterns, called spaced seeds (or gapped seeds), for the subwords, depending on the search parameters. Two distinct lines of research appear the literature: one with probabilistic formulations of seed design problems, in which one wishes for instance to compute a seed with the highest probability to detect the desired similarities (lossy filtration), a second line with combinatorial formulations, where the goal is to find a seed that detects all or a maximum number of similarities (both lossless and lossy filtration). We concentrate on combinatorial seed design problems and consider formulations in which the set of sought similarities is either listed explicitly (RSOS), or characterised by their length and maximal number of mismatches (Non-Detection). Several articles exhibit exponential algorithms for these problems. In this work, we provide hardness and inapproximability results for several seed design problems, thereby justifying the complexity of known algorithms. Moreover, we introduce a new formulation of seed design (MWLS), in which the weight of the seed has to be maximised, and show it is as difficult to approximate as Maximum Independent Set.