Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
Let $\D = \{d_1, d_2, ...d_D\}$ be a given collection of $D$ string documents of total length $n$, our task is to index $\D$, such that whenever a pattern $P$ (of length $p$) and an integer $k$ come as a query, those $k$ documents in which $P$ appears the most number of times can be listed efficiently. In this paper, we propose a compressed index taking $2|CSA|+D\log\frac{n}{D}+O(D)+o(n)$ bits of space, which answers a query with $O(t_{sa}\log k \log^{\epsilon} n)$ per document report time. This improves the $O(t_{sa}\log k\log^{1+\epsilon} n)$ per document report time of the previously best-known index with (asymptotically) the same space requirements [Belazzougui and Navarro, SPIRE 2011]. Here, $|CSA|$ represents the size (in bits) of the compressed suffix array (CSA) of the text obtained by concatenating all documents in $\D$, and $t_{sa}$ is the time for decoding a suffix array value using the CSA.