Indexing a dictionary for subset matching queries

Authors:
Gad M. Landau;Dekel Tsur;Oren Weimann
Affiliations:
Department of Computer Science, University of Haifa, Haifa, Israel;Department of Computer Science, Ben-Gurion University, Beer-Sheva, Israel;Faculty of Mathematics and Computer Science, Weizmann Institute of Science, Rehovot, Israel
Venue:
Algorithms and Applications
Year:
2010

Citing 18
Cited 0

Fast algorithms for finding nearest common ancestors

SIAM Journal on Computing
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
On the sorting-complexity of suffix tree construction

Journal of the ACM (JACM)
Verifying candidate matches in sparse and wildcard matching

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Deterministic dictionaries

Journal of Algorithms
Faster Algorithms for String Matching Problems: Matching the Convolution Bound

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Efficient text fingerprinting via Parikh mapping

Journal of Discrete Algorithms
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Novel Transformation Techniques Using Q-Heaps with Applications to Computational Geometry

SIAM Journal on Computing
Linear work suffix array construction

Journal of the ACM (JACM)
Character sets of strings

Journal of Discrete Algorithms
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Haplotype inference by pure Parsimony

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
New algorithms for text fingerprinting

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Suffix trays and suffix trists: structures for faster text indexing

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part I
A hidden markov technique for haplotype reconstruction

WABI'05 Proceedings of the 5th International conference on Algorithms in Bioinformatics
Minimum multicolored subgraph problem in multiplex PCR primer set selection and population haplotyping

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II
Haplotype inference via hierarchical genotype parsing

WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider a subset matching variant of the Dictionary Query problem. Consider a dictionary D of n strings, where each string location contains a set of characters drawn from some alphabet Σ={1,...,|Σ|}. Our goal is to preprocess D so when given a query pattern p, where each location in p contains a single character from Σ, we answer if p matches to D. p is said to match to D if there is some s∈D where |p|=|s| and p[i]∈s[i] for every 1≤i≤|p|. To achieve a query time of O(|p|), we construct a compressed trie of all possible patterns that appear in D. Assuming that for every s∈D there are at most k locations where |s[i]|1, we present two constructions of the trie that yield a preprocessing time of O(nm+|Σ|kn log( min {n,m})), where n is the number of strings in D and m is the maximum length of a string in D. The first construction is based on divide and conquer and the second construction uses ideas introduced in [2] for text fingerprinting. Furthermore, we show how to obtain O(nm+|Σ|kn+|Σ|k/2nlog( min {n,m})) preprocessing time and O(|p|loglog|Σ|+ min {|p|,log(|Σ|kn)}loglog(|Σ|kn)) query time by cutting the dictionary strings and constructing two compressed tries. Our problem is motivated by haplotype inference from a library of genotypes [13,16]. There, D is a known library of genotypes (|Σ|=2), and p is a haplotype. Indexing all possible haplotypes that can be inferred from D as well as gathering statistical information about them can be used to accelerate various haplotype inference algorithms.