Compressed data structures: Dictionaries and data-aware measures

  • Authors:
  • Ankur Gupta;Wing-Kai Hon;Rahul Shah;Jeffrey Scott Vitter

  • Affiliations:
  • Department of Computer Science, Butler University, Indianapolis, IN 46208, USA;Department of Computer Science, National Tsing Hsu University, Taiwan;Department of Computer Science, Louisiana State University, LA 70803, USA;Department of Computer Sciences, Purdue University, West Lafayette, IN 47907-2066, USA

  • Venue:
  • Theoretical Computer Science
  • Year:
  • 2007

Quantified Score

Hi-index 5.23

Visualization

Abstract

In this paper, we propose measures for compressed data structures, in which space usage is measured in a data-aware manner. In particular, we consider the fundamental dictionary problem on set data, where the task is to construct a data structure for representing a set S of n items out of a universe U={0,...,u-1} and supporting various queries on S. We use a well-known data-aware measure for set data called gap to bound the space of our data structures. We describe a novel dictionary structure that requires gap+O(nlog(u/n)/logn)+O(nloglog(u/n)) bits. Under the RAM model, our dictionary supports membership, rank, and predecessor queries in nearly optimal time, matching the time bound of Andersson and Thorup's predecessor structure [A. Andersson, M. Thorup, Tight(er) worst-case bounds on dynamic searching and priority queues, in: ACM Symposium on Theory of Computing, STOC, 2000], while simultaneously improving upon their space usage. We support select queries even faster in O(loglogn) time. Our dictionary structure uses exactly gap bits in the leading term (i.e., the constant factor is 1) and answers queries in near-optimal time. When seen from the worst-case perspective, we present the first O(nlog(u/n))-bit dictionary structure that supports these queries in near-optimal time under the RAM model. We also build a dictionary which requires the same space and supports membership, select, and partial rank queries even more quickly in O(loglogn) time. We go on to show that for many (real-world) datasets, data-aware methods lead to a worthwhile compression over combinatorial methods. To the best of our knowledge, these are the first results that achieve data-aware space usage and retain near-optimal time.