Compressing table data with column dependency

  • Authors:
  • Binh Dao Vo;Kiem-Phong Vo

  • Affiliations:
  • Columbia University, 2960 Broadway, New York, NY 10027, USA;AT&T Labs, Shannon Laboratory, 180 Park Avenue, Florham Park, NJ 07932, USA

  • Venue:
  • Theoretical Computer Science
  • Year:
  • 2007

Quantified Score

Hi-index 5.23

Visualization

Abstract

Tables are two-dimensional arrays given in row-major order. Such data have unique features that could be exploited for effective compression. For example, tables often represent database files with rows as records so certain columns or fields in a table may have few distinct values. This means that simply transposing the data can make it compress better. Further, a large source of information redundancy in a table is the correlation among columns representing related types of data. This paper formalizes the notion of column dependency as a way to capture this information redundancy across columns and discusses how to automatically compute and use it to substantially improve table compression.