Approximate string-matching with q-grams and maximal matches
Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Tiling figures of the plane with two bars
Computational Geometry: Theory and Applications
Approximation algorithms for NP-hard problems
Approximation algorithms for NP-hard problems
A threshold of ln n for approximating set cover
Journal of the ACM (JACM)
q-gram based database searching using a suffix array (QUASAR)
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
FLASH: A Fast Look-Up Algorithm for String Homology
Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
Better filtering with gapped q-grams
Fundamenta Informaticae - Special issue on computing patterns in strings
Sensitivity analysis and efficient method for identifying optimal spaced seeds
Journal of Computer and System Sciences
Estimating Seed Sensitivity on Homogeneous Alignments
BIBE '04 Proceedings of the 4th IEEE Symposium on Bioinformatics and Bioengineering
Efficient randomized pattern-matching algorithms
IBM Journal of Research and Development - Mathematics and computing
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Designing seeds for similarity search in genomic DNA
Journal of Computer and System Sciences - Special issue on bioinformatics II
Superiority and complexity of the spaced seeds
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Linear degree extractors and the inapproximability of max clique and chromatic number
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Indel seeds for homology search
Bioinformatics
On the complexity of the spaced seeds
Journal of Computer and System Sciences
Optimal spaced seeds for faster approximate string matching
Journal of Computer and System Sciences
Hardness of optimal spaced seed design
CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Parameterized Complexity
On the complexity of finding gapped motifs
Journal of Discrete Algorithms
Combinatorics on partial word correlations
Journal of Combinatorial Theory Series A
MPSCAN: fast localisation of multiple reads in genomes
WABI'09 Proceedings of the 9th international conference on Algorithms in bioinformatics
Spaced seeds design using perfect rulers
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Design and analysis of periodic multiple seeds
Theoretical Computer Science
Hi-index | 0.00 |
Speeding up approximate pattern matching is a line of research in stringology since the 80s. Practically fast approaches belong to the class of filtration algorithms, in which text regions dissimilar to the pattern are first excluded, and the remaining regions are then compared to the pattern by dynamic programming. Among the conditions used to test similarity between the regions and the pattern, many require a minimum number of common substrings between them. When only substitutions are taken into account for measuring dissimilarity, counting spaced subwords instead of substrings improves the filtration efficiency. However, a preprocessing step is required to design one or more patterns, called spaced seeds (or gapped seeds), for the subwords, depending on the search parameters. Two distinct lines of research appear the literature: one with probabilistic formulations of seed design problems, in which one wishes for instance to compute a seed with the highest probability to detect the desired similarities (lossy filtration), a second line with combinatorial formulations, where the goal is to find a seed that detects all or a maximum number of similarities (both lossless and lossy filtration). We concentrate on combinatorial seed design problems and consider formulations in which the set of sought similarities is either listed explicitly (RSOS), or characterised by their length and maximal number of mismatches (Non-Detection). Several articles exhibit exponential algorithms for these problems. In this work, we provide hardness and inapproximability results for several seed design problems, thereby justifying the complexity of known algorithms. Moreover, we introduce a new formulation of seed design (MWLS), in which the weight of the seed has to be maximised, and show it is as difficult to approximate as Maximum Independent Set.