How to wring a table dry: entropy compression of relations and querying of compressed relations

Authors:
Vijayshankar Raman;Garret Swart
Affiliations:
IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA
Venue:
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Year:
2006

Citing 13
Cited 16

Data compression using dynamic Markov modelling

The Computer Journal
Elements of information theory

Elements of information theory
Data compression on a database system

Communications of the ACM
The connectivity server: fast access to linkage information on the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
A decomposition storage model

SIGMOD '85 Proceedings of the 1985 ACM SIGMOD international conference on Management of data
SPARTAN: a model-based semantic compression system for massive data tables

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
The Art of Computer Programming, 2nd Ed. (Addison-Wesley Series in Computer Science and Information

The Art of Computer Programming, 2nd Ed. (Addison-Wesley Series in Computer Science and Information
Compressing Relations and Indexes

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Weaving Relations for Cache Performance

Proceedings of the 27th International Conference on Very Large Data Bases
Data Compression Support in Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Dictionary-based order-preserving string compression

The VLDB Journal — The International Journal on Very Large Data Bases
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Data compression in Oracle

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

How to barter bits for chronons: compression and bandwidth trade offs for database scans

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
RadixZip: linear time compression of token streams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Read-Optimized, Cache-Conscious, Page Layouts for Temporal Relational Data

DEXA '08 Proceedings of the 19th international conference on Database and Expert Systems Applications
Read-optimized databases, in depth

Proceedings of the VLDB Endowment
Rose: compressed, log-structured replication

Proceedings of the VLDB Endowment
Main-memory scan sharing for multi-core CPUs

Proceedings of the VLDB Endowment
Row-wise parallel predicate evaluation

Proceedings of the VLDB Endowment
Architecture of a Database System

Foundations and Trends in Databases
Efficient index compression in DB2 LUW

Proceedings of the VLDB Endowment
Changing base without losing space

Proceedings of the forty-second ACM symposium on Theory of computing
Fast integer compression using SIMD instructions

Proceedings of the Sixth International Workshop on Data Management on New Hardware
Speeding up queries in column stores: a case for compression

DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
Cheetah: a high performance, custom data warehouse on top of MapReduce

Proceedings of the VLDB Endowment
Modern B-Tree Techniques

Foundations and Trends in Databases
Query-aware compression of join results

Proceedings of the 16th International Conference on Extending Database Technology
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a method to compress relations close to their entropy while still allowing efficient queries. Column values are encoded into variable length codes to exploit skew in their frequencies. The codes in each tuple are concatenated and the resulting tuplecodes are sorted and delta-coded to exploit the lack of ordering in a relation. Correlation is exploited either by co-coding correlated columns, or by using a sort order that leverages the correlation. We prove that this method leads to near-optimal compression (within 4.3 bits/tuple of entropy), and in practice, we obtain up to a 40 fold compression ratio on vertical partitions tuned for TPC-H queries.We also describe initial investigations into efficient querying over compressed data. We present a novel Huffman coding scheme, called segregated coding, that allows range and equality predicates on compressed data, without accessing the full dictionary. We also exploit the delta coding to speed up scans, by reusing computations performed on nearly identical records. Initial results from a prototype suggest that with these optimizations, we can efficiently scan, tokenize and apply predicates on compressed relations.