Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient algorithms for document retrieval problems
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct Representation of Balanced Parentheses and Static Trees
SIAM Journal on Computing
High-order entropy-compressed text indexes
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
LATIN '00 Proceedings of the 4th Latin American Symposium on Theoretical Informatics
Augmenting Suffix Trees, with Applications
ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
SIAM Journal on Computing
Succinct data structures for flexible text retrieval systems
Journal of Discrete Algorithms
Compressed representations of sequences and full-text indexes
ACM Transactions on Algorithms (TALG)
Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets
ACM Transactions on Algorithms (TALG)
Rank and select revisited and extended
Theoretical Computer Science
Compressed Suffix Trees with Full Functionality
Theory of Computing Systems
Space-Efficient Algorithms for Document Retrieval
CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Linear pattern matching algorithms
SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
The myriad virtues of Wavelet Trees
Information and Computation
Efficient Data Structures for the Orthogonal Range Successor Problem
COCOON '09 Proceedings of the 15th Annual International Conference on Computing and Combinatorics
Efficient Index for Retrieving Top-k Most Frequent Documents
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Space-Efficient Framework for Top-k String Retrieval Problems
FOCS '09 Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science
Fast set intersection and two-patterns matching
LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
Inverted indexes for phrases and strings
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Compressed indexes for aligned pattern matching
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Top-k document retrieval in optimal time and linear space
Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
New algorithms on wavelet trees and applications to information retrieval
Theoretical Computer Science
LATIN'12 Proceedings of the 10th Latin American international conference on Theoretical Informatics
Towards an optimal space-and-query-time index for top-k document retrieval
CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Document listing for queries with excluded pattern
CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
Given a collection D of string documents {d1, d2, ..., d|D|} of total length n, which may be preprocessed, a fundamental task is to retrieve the most relevant documents for a given query. The query consists of a set of m patterns {P1, P2, ..., Pm}. To measure the relevance of a document with respect to the query patterns, we may define a score, such as the number of occurrences of these patterns in the document, or the proximity of the given patterns within the document. To control the size of the output, we may also specify a threshold (or a parameter K), so that our task is to report all the documents which match the query with score more than threshold (or respectively, the K documents with the highest scores). When the documents are strings (without word boundaries), the traditional inverted-index-based solutions may not be applicable. The single pattern retrieval case has been well-solved by [14,9]. When it comes to two or more patterns, the only non-trivial solution for proximity search and common document listing was given by [14], which took Õ(n3/2) space. In this paper, we give the first linear space (and partly succinct) data structures, which can answer multi-pattern queries in O(Σ|Pi|) + Õ (t1/mn1-1/m) time, where t is the number of output occurrences. In the particular case of two patterns, we achieve the bound of O(|P1|+|P2|+√nt log2 n). We also show space-time trade-offs for our data structures. Our approach is based on a novel data structure called the weight-balanced wavelet tree, which may be of independent interest.