String retrieval for multi-pattern queries

  • Authors:
  • Wing-Kai Hon;Rahul Shah;Sharma V. Thankachan;Jeffrey Scott Vitter

  • Affiliations:
  • Department of CS, National Tsing Hua University, Taiwan;Department of CS, Louisiana State University;Department of CS, Louisiana State University;Department of EECS, The University of Kansas

  • Venue:
  • SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given a collection D of string documents {d1, d2, ..., d|D|} of total length n, which may be preprocessed, a fundamental task is to retrieve the most relevant documents for a given query. The query consists of a set of m patterns {P1, P2, ..., Pm}. To measure the relevance of a document with respect to the query patterns, we may define a score, such as the number of occurrences of these patterns in the document, or the proximity of the given patterns within the document. To control the size of the output, we may also specify a threshold (or a parameter K), so that our task is to report all the documents which match the query with score more than threshold (or respectively, the K documents with the highest scores). When the documents are strings (without word boundaries), the traditional inverted-index-based solutions may not be applicable. The single pattern retrieval case has been well-solved by [14,9]. When it comes to two or more patterns, the only non-trivial solution for proximity search and common document listing was given by [14], which took Õ(n3/2) space. In this paper, we give the first linear space (and partly succinct) data structures, which can answer multi-pattern queries in O(Σ|Pi|) + Õ (t1/mn1-1/m) time, where t is the number of output occurrences. In the particular case of two patterns, we achieve the bound of O(|P1|+|P2|+√nt log2 n). We also show space-time trade-offs for our data structures. Our approach is based on a novel data structure called the weight-balanced wavelet tree, which may be of independent interest.