Space-efficient algorithms for document retrieval

  • Authors:
  • Niko Välimäki;Veli Mäkinen

  • Affiliations:
  • Department of Computer Science, University of Helsinki, Finland;Department of Computer Science, University of Helsinki, Finland

  • Venue:
  • CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study the Document Listing problem, where a collection D of documents d1,..., dk of total length Σi di = n is to be preprocessed, so that one can later efficiently list all the ndoc documents containing a given query pattern P of length m as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with O(n) time preprocessing, one can answer the queries in O(m + ndoc) time. In this paper, we improve the space-requirement of the Muthukrishnan's solution from O(n log n) bits to |CSA| + 2n + n log k(1 + o(1)) bits, where |CSA| ≤ n log |Σ|(1 + o(1)) is the size of any suitable compressed suffix array (CSA), and Σ is the underlying alphabet of documents. The time requirement depends on the CSA used, but we can obtain e.g. the optimal O(m+ndoc) time when |Σ|, k = O(polylog(n)). For general |Σ|, k the time requirement becomes O(mlog |Σ| + ndoc log k). Sadakane (ISAAC 2002) has developed a similar space-efficient variant of the Muthukrishnan's solution; we obtain a better time requirement in most cases, but a slightly worse space requirement.