Reordering rows for better compression: Beyond the lexicographic order

Authors:
Daniel Lemire;Owen Kaser;Eduardo Gutarra
Affiliations:
TELUQ;University of New Brunswick, Saint John;University of New Brunswick, Saint John
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2012

Citing 60
Cited 0

Data compression and Gray-code sorting

Information Processing Letters
Rearranging data to maximize the efficiency of compression

PODS '86 Proceedings of the fifth ACM SIGACT-SIGMOD symposium on Principles of database systems
Multiattribute hashing using Gray codes

SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
A new class of heuristic algorithms for weighted perfect matching

Journal of the ACM (JACM)
Spacefilling curves and the planar travelling salesman problem

Journal of the ACM (JACM)
Faster scaling algorithms for general graph matching problems

Journal of the ACM (JACM)
Quickly generating billion-record synthetic databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Using multiset discrimination to solve language processing problems without hashing

Theoretical Computer Science
Divide and conquer strategies for parallel TSP heuristics

Computers and Operations Research
When Hamming meets Euclid: the approximability of geometric TSP and MST (extended abstract)

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Locality-preserving hashing in multidimensional spaces

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Efficient search for approximate nearest neighbor in high dimensional spaces

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
A lower bound on the complexity of approximate nearest-neighbor searching on the Hamming cube

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Improving performance of sparse matrix-vector multiplication

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
Block-Oriented Compression Techniques for Large Statistical Databases

IEEE Transactions on Knowledge and Data Engineering
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Hilbert R-tree: An Improved R-tree using Fractals

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
On the Computation of Multidimensional Aggregates

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Optimal Page Ordering for Region Queries in Static Spatial Databases

DEXA '00 Proceedings of the 11th International Conference on Database and Expert Systems Applications
Reclustering of High Energy Physics Data

SSDBM '99 Proceedings of the 11th International Conference on Scientific and Statistical Database Management
Byte-aligned bitmap compression

DCC '95 Proceedings of the Conference on Data Compression
Chained Lin-Kernighan for Large Traveling Salesman Problems

INFORMS Journal on Computing
A strong lower bound for approximate nearest neighbor searching

Information Processing Letters
Compressing Bitmap Indices by Data Reorganization

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
8/7-approximation algorithm for (1,2)-TSP

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Optimizing bitmap indices with efficient compression

ACM Transactions on Database Systems (TODS)
Integrating compression and execution in column-oriented database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Simple and realistic data generation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Mixed-Radix Gray Codes in Lee Metric

IEEE Transactions on Computers
Data compression in Oracle

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Compressing large boolean matrices using reordering techniques

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
A comparison of five probabilistic view-size estimation techniques in OLAP

Proceedings of the ACM tenth international workshop on Data warehousing and OLAP
Compact Hilbert indices: Space-filling curves for domains with unequal side lengths

Information Processing Letters
Traveling salesman path problems

Mathematical Programming: Series A and B
Column-stores vs. row-stores: how different are they really?

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Optimizing Frequency Queries for Data Mining Applications

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Read-optimized databases, in depth

Proceedings of the VLDB Endowment
Dictionary-based order-preserving string compression for main memory column stores

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Sorting improves word-aligned bitmap indexes

Data & Knowledge Engineering
Efficient index compression in DB2 LUW

Proceedings of the VLDB Endowment
The Star Schema Benchmark and Augmented Fact Table Indexing

Performance Evaluation and Benchmarking
Index compression using 64-bit words

Software—Practice & Experience
The traveling salesman: computational solutions for TSP applications

The traveling salesman: computational solutions for TSP applications
Scalable techniques for document identifier assignment in inverted indexes

Proceedings of the 19th international conference on World wide web
An optimal algorithm for the distinct elements problem

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Speeding up queries in column stores: a case for compression

DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
NET-FLi: on-the-fly compression, archiving and indexing of streaming network traffic

Proceedings of the VLDB Endowment
Reordering columns for smaller indexes

Information Sciences: an International Journal
The Art of Computer Programming: Combinatorial Algorithms, Part 1

The Art of Computer Programming: Combinatorial Algorithms, Part 1
A Randomized Rounding Approach to the Traveling Salesman Problem

FOCS '11 Proceedings of the 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science
Run-length encodings (Corresp.)

IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding

IEEE Transactions on Information Theory
Match twice and stitch: a new TSP tour construction heuristic

Operations Research Letters
Real-time creation of bitmap indexes on streaming network data

The VLDB Journal — The International Journal on Very Large Data Bases
Minimizing index size by reordering rows and columns

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
A hilbert space compression architecture for data warehouse environments

DaWaK'07 Proceedings of the 9th international conference on Data Warehousing and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression. A new heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades off compression for a major running-time speedup, is a good option for very large tables. However, for some compression schemes, it is more important to generate long runs rather than few runs. For this case, another novel heuristic, Vortex, is promising. We find that we can improve run-length encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. We prove that the new row reordering is optimal (within 10%) at minimizing the runs of identical values within columns, in a few cases.