Compressing table data with column dependency

Authors:
Binh Dao Vo;Kiem-Phong Vo
Affiliations:
Columbia University, 2960 Broadway, New York, NY 10027, USA;AT&T Labs, Shannon Laboratory, 180 Park Avenue, Florham Park, NJ 07932, USA
Venue:
Theoretical Computer Science
Year:
2007

Citing 14
Cited 3

A locally adaptive data compression scheme

Communications of the ACM
Arithmetic coding for data compression

Communications of the ACM
Compression of Low Entropy Strings with Lempel--Ziv Algorithms

SIAM Journal on Computing
Engineering the compression of massive tables: an experimental approach

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice

IEEE Transactions on Computers
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Towards Compressing Web Graphs

DCC '01 Proceedings of the Data Compression Conference
Improving table compression with combinatorial optimization

Journal of the ACM (JACM)
Using Column Dependency to Compress Tables

DCC '04 Proceedings of the Conference on Data Compression
Boosting textual compression in optimal linear time

Journal of the ACM (JACM)
A Mathematical Theory of Communication

A Mathematical Theory of Communication
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding

IEEE Transactions on Information Theory

RadixZip: linear time compression of token streams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Data structures: time, I/Os, entropy, joules!

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Reordering columns for smaller indexes

Information Sciences: an International Journal

Quantified Score

Hi-index	5.23

Visualization

Abstract

Tables are two-dimensional arrays given in row-major order. Such data have unique features that could be exploited for effective compression. For example, tables often represent database files with rows as records so certain columns or fields in a table may have few distinct values. This means that simply transposing the data can make it compress better. Further, a large source of information redundancy in a table is the correlation among columns representing related types of data. This paper formalizes the notion of column dependency as a way to capture this information redundancy across columns and discusses how to automatically compute and use it to substantially improve table compression.