Compressing large boolean matrices using reordering techniques

Authors:
David Johnson;Shankar Krishnan;Jatin Chhugani;Subodh Kumar;Suresh Venkatasubramanian
Affiliations:
AT&T Labs - Research;AT&T Labs - Research;Johns Hopkins University;Johns Hopkins University;AT&T Labs - Research
Venue:
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Year:
2004

Citing 15
Cited 26

Parameterised compression for sparse bitmaps

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Management of large amounts of data in interactive building walkthroughs

I3D '92 Proceedings of the 1992 symposium on Interactive 3D graphics
Partitioning and ordering large radiosity computations

SIGGRAPH '94 Proceedings of the 21st annual conference on Computer graphics and interactive techniques
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Efficient search for approximate nearest neighbor in high dimensional spaces

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
MMR: an interactive massive model rendering system using geometric and image-based acceleration

I3D '99 Proceedings of the 1999 symposium on Interactive 3D graphics
Physical mapping of chromosomes: a combinatorial problem in molecular biology

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Engineering the compression of massive tables: an experimental approach

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
Performance Measurements of Compressed Bitmap Indices

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Walking Through a Very Large Virtual Environment in Real-time

Proceedings of the 27th International Conference on Very Large Data Bases
Clustering Data Streams: Theory and Practice

IEEE Transactions on Knowledge and Data Engineering
On the Impossibility of Dimension Reduction in \ell _1

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Improving table compression with combinatorial optimization

Journal of the ACM (JACM)
Fragments of order

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Compressing Bitmap Indices by Data Reorganization

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Vertex-transformation streams

Graphical Models - Special issue on PG2004
Approximate encoding for direct access and query processing over compressed bitmaps

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Rearrangement Clustering: Pitfalls, Remedies, and Applications

The Journal of Machine Learning Research
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
GraphScope: parameter-free mining of large time-evolving graphs

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
RLH: bitmap compression technique based on run-length and huffman encoding

Proceedings of the ACM tenth international workshop on Data warehousing and OLAP
Succinct summarization of transactional databases: an overlapped hyperrectangle scheme

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Analysis of Basic Data Reordering Techniques

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Dynamic data organization for bitmap indices

Proceedings of the 3rd international conference on Scalable information systems
Secondary bitmap indexes with vertical and horizontal partitioning

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
RLH: Bitmap compression technique based on run-length and Huffman encoding

Information Systems
A Bipartite Graph Framework for Summarizing High-Dimensional Binary, Categorical and Numeric Data

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Scalable techniques for document identifier assignment in inverted indexes

Proceedings of the 19th international conference on World wide web
Continuous summarization of co-evolving data in large water distribution network

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Reordering columns for smaller indexes

Information Sciences: an International Journal
Path-based supports for hypergraphs

IWOCA'10 Proceedings of the 21st international conference on Combinatorial algorithms
Summarizing transactional databases with overlapped hyperrectangles

Data Mining and Knowledge Discovery
ISABELA-QA: query-driven analytics with ISABELA-compressed extreme-scale scientific data

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
RIVA: indexing and visualization of high-dimensional data via dimension reorderings

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
A compression-boosting transform for two-dimensional data

AAIM'06 Proceedings of the Second international conference on Algorithmic Aspects in Information and Management
Path-based supports for hypergraphs

Journal of Discrete Algorithms
Reordering rows for better compression: Beyond the lexicographic order

ACM Transactions on Database Systems (TODS)
Processing a trillion cells per mouse click

Proceedings of the VLDB Endowment
Document identifier reassignment and run-length-compressed inverted indexes for improved search performance

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Bitlist: new full-text index for low space cost and efficient keyword search

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large boolean matrices are a basic representational unit in a variety of applications, with some notable examples being interactive visualization systems, mining large graph structures, and association rule mining. Designing space and time efficient scalable storage and query mechanisms for such large matrices is a challenging problem. We present a lossless compression strategy to store and access such large matrices efficiently on disk. Our approach is based on viewing the columns of the matrix as points in a very high dimensional Hamming space, and then formulating an appropriate optimization problem that reduces to solving an instance of the Traveling Salesman Problem on this space. Finding good solutions to large TSP's in high dimensional Hamming spaces is itself a challenging and little-explored problem -- we cannot readily exploit geometry to avoid the need to examine all N2 inter-city distances and instances can be too large for standard TSP codes to run in main memory. Our multi-faceted approach adapts classical TSP heuristics by means of instance-partitioning and sampling, and may be of independent interest. For instances derived from interactive visualization and telephone call data we obtain significant improvement in access time over standard techniques, and for the visualization application we also make significant improvements in compression.