Approximate string-matching with q-grams and maximal matches
Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
q-gram based database searching using a suffix array (QUASAR)
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
On the power of universal bases in sequencing by hybridization
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
A fast bit-vector algorithm for approximate string matching based on dynamic programming
Journal of the ACM (JACM)
Sequencing-by-hybridization at the information-theory bound: an optimal algorithm
RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
FLASH: A Fast Look-Up Algorithm for String Homology
Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
Approximate Pattern Matching with Samples
ISAAC '94 Proceedings of the 5th International Symposium on Algorithms and Computation
Approximate String-Matching over Suffix Trees
CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
On Using q-Gram Locations in Approximate String Matching
ESA '95 Proceedings of the Third Annual European Symposium on Algorithms
Computing the Threshold for q-Gram Filters
SWAT '02 Proceedings of the 8th Scandinavian Workshop on Algorithm Theory
One-Gapped q-Gram Filtersfor Levenshtein Distance
CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Better filtering with gapped q-grams
Fundamenta Informaticae - Special issue on computing patterns in strings
Sensitivity analysis and efficient method for identifying optimal spaced seeds
Journal of Computer and System Sciences
On spaced seeds for similarity search
Discrete Applied Mathematics
Indexing schemes for similarity search in datasets of short protein fragments
Information Systems
Lossless filter for multiple repetitions with Hamming distance
Journal of Discrete Algorithms
Seed optimization for i.i.d. similarities is no easier than optimal Golomb ruler design
Information Processing Letters
Masking patterns in sequences: A new class of motif discovery with don't cares
Theoretical Computer Science
Selecting Oligonucleotide Probes for Whole-Genome Tiling Arrays with a Cross-Hybridization Potential
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
A new approach to sequence representation of proteins in bioinformatics
MICAI'05 Proceedings of the 4th Mexican international conference on Advances in Artificial Intelligence
Efficient q-gram filters for finding all ε-matches over a given length
RECOMB'05 Proceedings of the 9th Annual international conference on Research in Computational Molecular Biology
The q-gram distance for ordered unlabeled trees
DS'05 Proceedings of the 8th international conference on Discovery Science
SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Better Filtering with Gapped q-Grams
Fundamenta Informaticae - Computing Patterns in Strings
Fast computation of good multiple spaced seeds
WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics
Hi-index | 0.00 |
The q-gram filter is a popular filtering method for approximate string matching. It compares substrings of length q (the q-grams) in the pattern and the text to identify the text areas that might contain a match. A generalization of the method is to use gapped q-grams, subsets of q characters in some fixed non-contiguous shape, instead of contiguous substrings. Although mentioned a few times in the literature, this generalization has never been studied in any depth. In this paper, we report the first results from a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. The performance, however, depends on the shape of the q-grams. The best shapes are rare and often possess no apparent regularity. We show how to recognize good shapes and demonstrate with experiments their advantage over both contiguous and average shapes. We concentrate here on the k mismatches problem, but also outline an approach for extending the results to the more common k differences problem.