Compressed data structures: Dictionaries and data-aware measures

Authors:
Ankur Gupta;Wing-Kai Hon;Rahul Shah;Jeffrey Scott Vitter
Affiliations:
Department of Computer Science, Butler University, Indianapolis, IN 46208, USA;Department of Computer Science, National Tsing Hsu University, Taiwan;Department of Computer Science, Louisiana State University, LA 70803, USA;Department of Computer Sciences, Purdue University, West Lafayette, IN 47907-2066, USA
Venue:
Theoretical Computer Science
Year:
2007

Citing 19
Cited 10

New trie data structures which support very fast search operations

Journal of Computer and System Sciences
Surpassing the information theoretic bound with fusion trees

Journal of Computer and System Sciences - Special issue: papers from the 22nd ACM symposium on the theory of computing, May 14–16, 1990
Data compression in full-text retrieval systems

Journal of the American Society for Information Science
Optimal bounds for the predecessor problem

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Tight(er) worst-case bounds on dynamic searching and priority queues

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Membership in Constant Time and Almost-Minimum Space

SIAM Journal on Computing
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Mathematics for the Analysis of Algorithms

Mathematics for the Analysis of Algorithms
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Tables

Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
IP Address Lookup Made Fast and Simple

ESA '99 Proceedings of the 7th Annual European Symposium on Algorithms
Searching in Compressed Dictionaries

DCC '02 Proceedings of the Data Compression Conference
Compact representations of ordered sets

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Dictionaries using variable-length keys and data, with applications

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Squeezing succinct data structures into entropy bounds

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Time-space trade-offs for predecessor search

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Rank and select revisited and extended

Theoretical Computer Science

Monotone minimal perfect hashing: searching a sorted table with O(1) accesses

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Note: On compact representations of All-Pairs-Shortest-Path-Distance matrices

Theoretical Computer Science
Sampled longest common prefix array

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Fast prefix search in little space, with applications

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part I
E=I+T: The internal extent formula for compacted tries

Information Processing Letters
Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks

Proceedings of the 20th international conference on World wide web
Inverted indexes for phrases and strings

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Theory and practice of monotone minimal perfect hashing

Journal of Experimental Algorithmics (JEA)
Improved address-calculation coding of integer arrays

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
The Solid* toolset for software visual analytics of program structure and metrics comprehension: From research prototype to product

Science of Computer Programming

Quantified Score

Hi-index	5.23

Visualization

Abstract

In this paper, we propose measures for compressed data structures, in which space usage is measured in a data-aware manner. In particular, we consider the fundamental dictionary problem on set data, where the task is to construct a data structure for representing a set S of n items out of a universe U={0,...,u-1} and supporting various queries on S. We use a well-known data-aware measure for set data called gap to bound the space of our data structures. We describe a novel dictionary structure that requires gap+O(nlog(u/n)/logn)+O(nloglog(u/n)) bits. Under the RAM model, our dictionary supports membership, rank, and predecessor queries in nearly optimal time, matching the time bound of Andersson and Thorup's predecessor structure [A. Andersson, M. Thorup, Tight(er) worst-case bounds on dynamic searching and priority queues, in: ACM Symposium on Theory of Computing, STOC, 2000], while simultaneously improving upon their space usage. We support select queries even faster in O(loglogn) time. Our dictionary structure uses exactly gap bits in the leading term (i.e., the constant factor is 1) and answers queries in near-optimal time. When seen from the worst-case perspective, we present the first O(nlog(u/n))-bit dictionary structure that supports these queries in near-optimal time under the RAM model. We also build a dictionary which requires the same space and supports membership, select, and partial rank queries even more quickly in O(loglogn) time. We go on to show that for many (real-world) datasets, data-aware methods lead to a worthwhile compression over combinatorial methods. To the best of our knowledge, these are the first results that achieve data-aware space usage and retain near-optimal time.