Opportunistic data structures with applications

  • Authors:
  • Paolo Ferragina;Giovanni Manzini

  • Affiliations:
  • -;-

  • Venue:
  • Opportunistic data structures with applications
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

There is an upsurging interest in designing succinct data structures for basic searching problems (see [Munro99] and references therein). The motivation has to be found in the exponential increase of electronic data nowadays available which is even surpassing the significant increase in memory and disk storage capacities of current computers. Space reduction is an attractive issue because it is also intimately related to performance improvements as noted by several authors (e.g. Knuth [knuth3], Bentley [bentley]). In designing these implicit data structures the goal is to reduce as much as possible the auxiliary information kept together with the input data without introducing a significant slowdown in the final query performance. Yet input data are represented in their entirety thus taking no advantage of possible repetitiveness into them. The importance of those issues is well known to programmers who typically use various tricks to squeeze data as much as possible and still achieve good query performance. Their approaches, thought, boil down to heuristics whose effectiveness is witnessed only by experimentation. In this paper, we address the issue of compressing and indexing data by studying it in a theoretical framework. We devise a novel data structure for indexing and searching whose space occupancy is a function of the entropy of the underlying data set. The novelty resides in the careful combination of a compression algorithm, proposed by Burrows-Wheeler [bw], with the structural properties of a well known indexing tool, the Suffix Array [MM93]. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible at no significant slowdown in the query performance and without any assumption on a particular fixed distribution. More precisely, its space occupancy is optimal in a information-content sense because a text $T[1,u]$ is stored using $O(k(T)) + o(1)$ bits per input symbol, where $k(T)$ is the $k$th order entropy of $T$ (the bound holds for any fixed $k$). Given an arbitrary string $P[1,p]$, the opportunistic data structure allows to search for the $occ$ occurrences of $P$ in $T$ requiring $O(p + occ \log^\epsilon u)$ time complexity (for any fixed $\epsilon