Indexing a dictionary for subset matching queries

Authors:
Gad M. Landau;Dekel Tsur;Oren Weimann
Affiliations:
Department of Computer Science, University of Haifa, Haifa, Israel and Department of Computer and Information Science, Polytechnic University, New York;Department of Computer Science, Ben-Gurion University, Beer-Sheva, Israel;Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA
Venue:
SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Year:
2007

Citing 15
Cited 1

Fast algorithms for finding nearest common ancestors

SIAM Journal on Computing
Parallel Suffix--Prefix-Matching Algorithm and Applications

SIAM Journal on Computing
Verifying candidate matches in sparse and wildcard matching

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Deterministic dictionaries

Journal of Algorithms
Faster Algorithms for String Matching Problems: Matching the Convolution Bound

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Rapid identification of repeated patterns in strings, trees and arrays

STOC '72 Proceedings of the fourth annual ACM symposium on Theory of computing
Efficient text fingerprinting via Parikh mapping

Journal of Discrete Algorithms
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Novel Transformation Techniques Using Q-Heaps with Applications to Computational Geometry

SIAM Journal on Computing
Character sets of strings

Journal of Discrete Algorithms
Haplotype inference by pure Parsimony

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
New algorithms for text fingerprinting

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Suffix trays and suffix trists: structures for faster text indexing

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part I
Phasing and missing data recovery in family trios

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
Minimum multicolored subgraph problem in multiplex PCR primer set selection and population haplotyping

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II

On building minimal automaton for subset matching queries

Information Processing Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider a subset matching variant of the Dictionary Query problem. Consider a dictionary D of n strings, where each string location contains a set of characters drawn from some alphabet Σ. Our goal is to preprocess D so when given a query pattern p, where each location in p contains a single character from Σ, we answer if p matches to D. p is said to match to D if there is some s ∈ D where |p| = |s| and p[i] ∈ s[i] for every 1 ≤ i ≤ |p|. To achieve a query time of O(|p|), we construct a compressed trie of all possible patterns that appear in D. Assuming that for every s ∈ D there are at most k locations where |s[i]| 1, we present two constructions of the trie that yield a preprocessing time of O(nm + |Σ|kn lg(min{n, m})), where n is the number of strings in D and m is the maximum length of a string in D. The first construction is based on divide and conquer and the second construction uses ideas introduced in [2] for text fingerprinting. Furthermore, we show how to obtain O(nm + |Σ|kn + |Σ|k/2n lg(min{n, m})) preprocessing time and O(|p| lg lg |&Sigma| + min{|p|, lg(|Σ|kn)} lg lg(|Σ|kn)) query time by cutting the dictionary strings and constructing two compressed tries. Our problem is motivated by haplotype inference from a library of genotypes [14,17]. There, D is a known library of genotypes (|Σ| = 2), and p is a haplotype. Indexing all possible haplotypes that can be inferred from D as well as gathering statistical information about them can be used to accelerate various haplotype inference algorithms. In particular, algorithms based on the "pure parsimony criteria" [13,16], greedy heuristics such as "Clarks rule" [6,18], EM based algorithms [1,11,12,20,26,30], and algorithms for inferring haplotypes from a set of Trios [4,27].