Elements of information theory
Elements of information theory
The traveling salesman problem with distances one and two
Mathematics of Operations Research
Linear approximation of shortest superstrings
Journal of the ACM (JACM)
A new challenge for compression algorithms: genetic sequences
Information Processing and Management: an International Journal - Special issue: data compression
Data compression on a database system
Communications of the ACM
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Proof verification and the hardness of approximation problems
Journal of the ACM (JACM)
Compression of Low Entropy Strings with Lempel--Ziv Algorithms
SIAM Journal on Computing
Engineering the compression of massive tables: an experimental approach
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
DCC '99 Proceedings of the Conference on Data Compression
Compressing table data with column dependency
Theoretical Computer Science
Compressing large boolean matrices using reordering techniques
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
RadixZip: linear time compression of token streams
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
AlphaSum: size-constrained table summarization using value lattices
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Reducing metadata complexity for faster table summarization
Proceedings of the 13th International Conference on Extending Database Technology
Data structures: time, I/Os, entropy, joules!
ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
PRESIDIO: A Framework for Efficient Archival Data Storage
ACM Transactions on Storage (TOS)
Hi-index | 0.01 |
We study the problem of compressing massive tables within the partition-training paradigm introduced by Buchsbaum et al. [2000], in which a table is partitioned by an off-line training procedure into disjoint intervals of columns, each of which is compressed separately by a standard, on-line compressor like gzip. We provide a new theory that unifies previous experimental observations on partitioning and heuristic observations on column permutation, all of which are used to improve compression rates. Based on this theory, we devise the first on-line training algorithms for table compression, which can be applied to individual files, not just continuously operating sources; and also a new, off-line training algorithm, based on a link to the asymmetric traveling salesman problem, which improves on prior work by rearranging columns prior to partitioning. We demonstrate these results experimentally. On various test files, the on-line algorithms provide 35--55% improvement over gzip with negligible slowdown; the off-line reordering provides up to 20% further improvement over partitioning alone. We also show that a variation of the table compression problem is MAX-SNP hard.