Indexing a dictionary for subset matching queries

  • Authors:
  • Gad M. Landau;Dekel Tsur;Oren Weimann

  • Affiliations:
  • Department of Computer Science, University of Haifa, Haifa, Israel;Department of Computer Science, Ben-Gurion University, Beer-Sheva, Israel;Faculty of Mathematics and Computer Science, Weizmann Institute of Science, Rehovot, Israel

  • Venue:
  • Algorithms and Applications
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider a subset matching variant of the Dictionary Query problem. Consider a dictionary D of n strings, where each string location contains a set of characters drawn from some alphabet Σ={1,...,|Σ|}. Our goal is to preprocess D so when given a query pattern p, where each location in p contains a single character from Σ, we answer if p matches to D. p is said to match to D if there is some s∈D where |p|=|s| and p[i]∈s[i] for every 1≤i≤|p|. To achieve a query time of O(|p|), we construct a compressed trie of all possible patterns that appear in D. Assuming that for every s∈D there are at most k locations where |s[i]|1, we present two constructions of the trie that yield a preprocessing time of O(nm+|Σ|kn log( min {n,m})), where n is the number of strings in D and m is the maximum length of a string in D. The first construction is based on divide and conquer and the second construction uses ideas introduced in [2] for text fingerprinting. Furthermore, we show how to obtain O(nm+|Σ|kn+|Σ|k/2nlog( min {n,m})) preprocessing time and O(|p|loglog|Σ|+ min {|p|,log(|Σ|kn)}loglog(|Σ|kn)) query time by cutting the dictionary strings and constructing two compressed tries. Our problem is motivated by haplotype inference from a library of genotypes [13,16]. There, D is a known library of genotypes (|Σ|=2), and p is a haplotype. Indexing all possible haplotypes that can be inferred from D as well as gathering statistical information about them can be used to accelerate various haplotype inference algorithms.