Space-Efficient Algorithms for Document Retrieval

Authors:
Niko Välimäki;Veli Mäkinen
Affiliations:
Department of Computer Science, University of Helsinki, Finland.;Department of Computer Science, University of Helsinki, Finland.
Venue:
CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Year:
2007

Citing 0
Cited 28

Range Quantile Queries: Another Virtue of Wavelet Trees

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Efficient Index for Retrieving Top-k Most Frequent Documents

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Efficient index for retrieving top-k most frequent documents

Journal of Discrete Algorithms
Top-k ranked document search in general text databases

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Compressed self-indices supporting conjunctive queries on document collections

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
String retrieval for multi-pattern queries

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Colored range queries and document retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Practical compressed document retrieval

SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Inverted indexes for phrases and strings

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Improved compressed indexes for full-text document retrieval

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays

SIAM Journal on Computing
Top-k document retrieval in optimal time and linear space

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Top-K color queries for document retrieval

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
New algorithms on wavelet trees and applications to information retrieval

Theoretical Computer Science
Forbidden patterns

LATIN'12 Proceedings of the 10th Latin American international conference on Theoretical Informatics
Efficient in-memory top-k document retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Space-Efficient top-k document retrieval

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Towards an optimal space-and-query-time index for top-k document retrieval

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Document listing for queries with excluded pattern

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
New lower and upper bounds for representing sequences

ESA'12 Proceedings of the 20th Annual European conference on Algorithms
The wavelet matrix

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Improved compressed indexes for full-text document retrieval

Journal of Discrete Algorithms
Space-efficient representations of rectangle datasets supporting orthogonal range querying

Information Systems
Colored range queries and document retrieval

Theoretical Computer Science
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)
Wavelet trees for all

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the Document Listing problem, where a collection D of documents d 1,...,d k of total length 驴 i d i = n is to be preprocessed, so that one can later efficiently list all the $\textrm{ndoc}$ documents containing a given query pattern P of length m as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with O(n) time preprocessing, one can answer the queries in $O(m+\textrm{ndoc})$ time. In this paper, we improve the space-requirement of the Muthukrishnan's solution from O(n logn) bits to |CSA| + 2n + nlogk (1 + o(1)) bits, where |CSA| ≤ n log|Σ|(1 + o(1)) is the size of any suitable compressed suffix array (CSA), and Σ is the underlying alphabet of documents. The time requirement depends on the CSA used, but we can obtain e.g. the optimal $O(m+\textrm{ndoc})$ time when . For general |Σ|,k the time requirement becomes $O(m \log |\Sigma|+\textrm{ndoc} \log k)$. Sadakane (ISAAC 2002) has developed a similar space-efficient variant of the Muthukrishnan's solution; we obtain a better time requirement in most cases, but a slightly worse space requirement.