Faster Compressed Top-k Document Retrieval

Authors:
Wing-Kai Hon;Sharma V. Thankachan;Rahul Shah;Jeffrey Scott Vitter
Affiliations:
-;-;-;-
Venue:
DCC '13 Proceedings of the 2013 Data Compression Conference
Year:
2013

Citing 0
Cited 1

Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Let $\D = \{d_1, d_2, ...d_D\}$ be a given collection of $D$ string documents of total length $n$, our task is to index $\D$, such that whenever a pattern $P$ (of length $p$) and an integer $k$ come as a query, those $k$ documents in which $P$ appears the most number of times can be listed efficiently. In this paper, we propose a compressed index taking $2|CSA|+D\log\frac{n}{D}+O(D)+o(n)$ bits of space, which answers a query with $O(t_{sa}\log k \log^{\epsilon} n)$ per document report time. This improves the $O(t_{sa}\log k\log^{1+\epsilon} n)$ per document report time of the previously best-known index with (asymptotically) the same space requirements [Belazzougui and Navarro, SPIRE 2011]. Here, $|CSA|$ represents the size (in bits) of the compressed suffix array (CSA) of the text obtained by concatenating all documents in $\D$, and $t_{sa}$ is the time for decoding a suffix array value using the CSA.