Better Filtering with Gapped q-Grams

Authors:
Stefan Burkhardt;Juha Kärkkäinen
Affiliations:
-;-
Venue:
CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Year:
2001

Citing 9
Cited 16

Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
On the power of universal bases in sequencing by hybridization

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
Sequencing-by-hybridization at the information-theory bound: an optimal algorithm

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
FLASH: A Fast Look-Up Algorithm for String Homology

Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
Approximate Pattern Matching with Samples

ISAAC '94 Proceedings of the 5th International Symposium on Algorithms and Computation
Approximate String-Matching over Suffix Trees

CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
On Using q-Gram Locations in Approximate String Matching

ESA '95 Proceedings of the Third Annual European Symposium on Algorithms

Computing the Threshold for q-Gram Filters

SWAT '02 Proceedings of the 8th Scandinavian Workshop on Algorithm Theory
One-Gapped q-Gram Filtersfor Levenshtein Distance

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Better filtering with gapped q-grams

Fundamenta Informaticae - Special issue on computing patterns in strings
Sensitivity analysis and efficient method for identifying optimal spaced seeds

Journal of Computer and System Sciences
On spaced seeds for similarity search

Discrete Applied Mathematics
Indexing schemes for similarity search in datasets of short protein fragments

Information Systems
Lossless filter for multiple repetitions with Hamming distance

Journal of Discrete Algorithms
Seed optimization for i.i.d. similarities is no easier than optimal Golomb ruler design

Information Processing Letters
Masking patterns in sequences: A new class of motif discovery with don't cares

Theoretical Computer Science
Selecting Oligonucleotide Probes for Whole-Genome Tiling Arrays with a Cross-Hybridization Potential

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
A new approach to sequence representation of proteins in bioinformatics

MICAI'05 Proceedings of the 4th Mexican international conference on Advances in Artificial Intelligence
Efficient q-gram filters for finding all ε-matches over a given length

RECOMB'05 Proceedings of the 9th Annual international conference on Research in Computational Molecular Biology
The q-gram distance for ordered unlabeled trees

DS'05 Proceedings of the 8th international conference on Discovery Science
Lossless filter for finding long multiple approximate repetitions using a new data structure, the bi-factor array

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Better Filtering with Gapped q-Grams

Fundamenta Informaticae - Computing Patterns in Strings
Fast computation of good multiple spaced seeds

WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The q-gram filter is a popular filtering method for approximate string matching. It compares substrings of length q (the q-grams) in the pattern and the text to identify the text areas that might contain a match. A generalization of the method is to use gapped q-grams, subsets of q characters in some fixed non-contiguous shape, instead of contiguous substrings. Although mentioned a few times in the literature, this generalization has never been studied in any depth. In this paper, we report the first results from a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. The performance, however, depends on the shape of the q-grams. The best shapes are rare and often possess no apparent regularity. We show how to recognize good shapes and demonstrate with experiments their advantage over both contiguous and average shapes. We concentrate here on the k mismatches problem, but also outline an approach for extending the results to the more common k differences problem.