Space-Efficient Algorithms for Document Retrieval

  • Authors:
  • Niko Välimäki;Veli Mäkinen

  • Affiliations:
  • Department of Computer Science, University of Helsinki, Finland.;Department of Computer Science, University of Helsinki, Finland.

  • Venue:
  • CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study the Document Listing problem, where a collection D of documents d 1,...,d k of total length 驴 i d i = n is to be preprocessed, so that one can later efficiently list all the $\textrm{ndoc}$ documents containing a given query pattern P of length m as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with O(n) time preprocessing, one can answer the queries in $O(m+\textrm{ndoc})$ time. In this paper, we improve the space-requirement of the Muthukrishnan's solution from O(n logn) bits to |CSA| + 2n + nlogk (1 + o(1)) bits, where |CSA| ≤ n log|Σ|(1 + o(1)) is the size of any suitable compressed suffix array (CSA), and Σ is the underlying alphabet of documents. The time requirement depends on the CSA used, but we can obtain e.g. the optimal $O(m+\textrm{ndoc})$ time when . For general |Σ|,k the time requirement becomes $O(m \log |\Sigma|+\textrm{ndoc} \log k)$. Sadakane (ISAAC 2002) has developed a similar space-efficient variant of the Muthukrishnan's solution; we obtain a better time requirement in most cases, but a slightly worse space requirement.