String retrieval for multi-pattern queries

Authors:
Wing-Kai Hon;Rahul Shah;Sharma V. Thankachan;Jeffrey Scott Vitter
Affiliations:
Department of CS, National Tsing Hua University, Taiwan;Department of CS, Louisiana State University;Department of CS, Louisiana State University;Department of EECS, The University of Kansas
Venue:
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Year:
2010

Citing 20
Cited 8

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient algorithms for document retrieval problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct Representation of Balanced Parentheses and Static Trees

SIAM Journal on Computing
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
The LCA Problem Revisited

LATIN '00 Proceedings of the 4th Latin American Symposium on Theoretical Informatics
Augmenting Suffix Trees, with Applications

ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Succinct data structures for flexible text retrieval systems

Journal of Discrete Algorithms
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets

ACM Transactions on Algorithms (TALG)
Rank and select revisited and extended

Theoretical Computer Science
Compressed Suffix Trees with Full Functionality

Theory of Computing Systems
Space-Efficient Algorithms for Document Retrieval

CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
The myriad virtues of Wavelet Trees

Information and Computation
Efficient Data Structures for the Orthogonal Range Successor Problem

COCOON '09 Proceedings of the 15th Annual International Conference on Computing and Combinatorics
Efficient Index for Retrieving Top-k Most Frequent Documents

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Space-Efficient Framework for Top-k String Retrieval Problems

FOCS '09 Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science
Fast set intersection and two-patterns matching

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics

Inverted indexes for phrases and strings

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Compressed indexes for aligned pattern matching

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Top-k document retrieval in optimal time and linear space

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
New algorithms on wavelet trees and applications to information retrieval

Theoretical Computer Science
Forbidden patterns

LATIN'12 Proceedings of the 10th Latin American international conference on Theoretical Informatics
Towards an optimal space-and-query-time index for top-k document retrieval

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Document listing for queries with excluded pattern

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a collection D of string documents {d1, d2, ..., d|D|} of total length n, which may be preprocessed, a fundamental task is to retrieve the most relevant documents for a given query. The query consists of a set of m patterns {P1, P2, ..., Pm}. To measure the relevance of a document with respect to the query patterns, we may define a score, such as the number of occurrences of these patterns in the document, or the proximity of the given patterns within the document. To control the size of the output, we may also specify a threshold (or a parameter K), so that our task is to report all the documents which match the query with score more than threshold (or respectively, the K documents with the highest scores). When the documents are strings (without word boundaries), the traditional inverted-index-based solutions may not be applicable. The single pattern retrieval case has been well-solved by [14,9]. When it comes to two or more patterns, the only non-trivial solution for proximity search and common document listing was given by [14], which took Õ(n3/2) space. In this paper, we give the first linear space (and partly succinct) data structures, which can answer multi-pattern queries in O(Σ|Pi|) + Õ (t1/mn1-1/m) time, where t is the number of output occurrences. In the particular case of two patterns, we achieve the bound of O(|P1|+|P2|+√nt log2 n). We also show space-time trade-offs for our data structures. Our approach is based on a novel data structure called the weight-balanced wavelet tree, which may be of independent interest.