Dynamic entropy-compressed sequences and full-text indexes

  • Authors:
  • Veli Mäkinen;Gonzalo Navarro

  • Affiliations:
  • University of Helsinki, Helsinki, Finland;University of Chile, Santiago, Chile

  • Venue:
  • ACM Transactions on Algorithms (TALG)
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We give new solutions to the Searchable Partial Sums with Indels problem. Given a sequence of n k-bit numbers, we present a structure taking kn + o(kn) bits of space, able of performing operations sum, search, insert, and delete, all in O(log n) worst-case time, for any k = O(log n). This extends previous results by Hon et al. [2003c] achieving the same space and O(log n/log log n) time complexities for the queries, yet offering complexities for insert and delete that are amortized and worse than ours, and supported only for k = O(1). Our result matches an existing lower bound for large values of k. We also give new solutions to the Dynamic Sequence problem. Given a sequence of n symbols in the range [1,σ] with binary zero-order entropy H0, we present a dynamic data structure that requires nH0 + o(n log σ) bits of space, which is able of performing rank and select, as well as inserting and deleting symbols at arbitrary positions, in O(log n log σ) time. Our result is the first entropy-bound dynamic data structure for rank and select over general sequences. In the case σ = 2, where both previous problems coincide, we improve the dynamic solution of Hon et al. [2003c] in that we compress the sequence. The only previous result with entropy-bound space for dynamic binary sequences is by Blandford and Blelloch [2004], which has the same complexities as our structure, but does not achieve constant 1 multiplying the entropy term in the space complexity. Finally, we present a new dynamic compressed full-text self-index, for a collection of texts over an alphabet of size σ, of overall length n and hth order empirical entropy Hh. The index requires nHh + o(n log σ) bits of space, for any h ≤ α logsigma n and constant 0 m in time O(m log n log σ). Each such occurrence can be reported in O(log2nlog log n) time, and displaying a context of length ℓ from a text takes time O(log n(ℓ log σ + log n log log n)). Insertion/deletion of a text to/from the collection takes O(log n log σ) time per symbol. This significantly improves the space of a previous result by Chan et al. [2004] in exchange for a slight time complexity penalty. We achieve at the same time the first dynamic index requiring essentially nHh bits of space, and the first construction of a compressed full-text self-index within that working space. Previous results achieve at best O(nHh space with constants larger than 1 [Ferragina and Manzini 2000; Arroyuelo and Navarro 2005] and higher time complexities. An important result we prove in this paper is that the wavelet tree of the Burrows-Wheeler transform of a text, if compressed with a technique that achieves zero-order compression locally (e.g., Raman et al. [2002]), automatically achieves hth order entropy space for any h. This unforeseen relation is essential for the results of the previous paragraph, but it also derives into significant simplifications on many existing static compressed full-text self-indexes that build on wavelet trees.