Elements of information theory
Elements of information theory
The traveling salesman problem with distances one and two
Mathematics of Operations Research
Linear approximation of shortest superstrings
Journal of the ACM (JACM)
A new challenge for compression algorithms: genetic sequences
Information Processing and Management: an International Journal - Special issue: data compression
Data compression on a database system
Communications of the ACM
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Compression of Low Entropy Strings with Lempel--Ziv Algorithms
SIAM Journal on Computing
Engineering the compression of massive tables: an experimental approach
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
DCC '99 Proceedings of the Conference on Data Compression
Migrating an MVS mainframe application to a PC
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Optimal partitions of strings: a new class of Burrows-Wheeler compression algorithms
CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
A compression-boosting transform for two-dimensional data
AAIM'06 Proceedings of the Second international conference on Algorithmic Aspects in Information and Management
Hi-index | 0.01 |
We study the problem of compressing massive tables within the partition-training paradigm introduced by Buchsbaum et al. [SODA'00], in which a table is partitioned by an off-line training procedure into disjoint intervals of columns, each of which is compressed separately by a standard, on-line compressor like gzip. We provide a new theory that unifies previous experimental observations on partitioning and heuristic observations on column permutation, all of which are used to improve compression rates. Based on the theory, we devise the first on-line training algorithms for table compression, which can be applied to individual files, not just continuously operating sources; and also a new, off-line training algorithm, based on a link to the asymmetric traveling salesman problem, which improves on prior work by rearranging columns prior to partitioning. We demonstrate these results experimentally. On various test files, the on-line algorithms provide 35-55% improvement over gzip with negligible slowdown; the off-line reordering provides up to 20% further improvement over partitioning alone. We also show that a variation of the table compression problem is MAX-SNP hard.