SPARTAN: a model-based semantic compression system for massive data tables

Authors:
Shivnath Babu;Minos Garofalakis;Rajeev Rastogi
Affiliations:
Stanford University and Bell Laboratories;Bell Laboratories;Bell Laboratories
Venue:
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Year:
2001

Citing 12
Cited 21

A Bayesian method for constructing Bayesian belief networks from databases

Proceedings of the seventh conference (1991) on Uncertainty in artificial intelligence
A Bayesian Method for the Induction of Probabilistic Networks from Data

Machine Learning
Learning belief networks from data: an information theory based approach

CIKM '97 Proceedings of the sixth international conference on Information and knowledge management
Causality: models, reasoning, and inference

Causality: models, reasoning, and inference
Engineering the compression of massive tables: an experimental approach

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
SNMP,SNMPV2,Snmpv3,and RMON 1 and 2

SNMP,SNMPV2,Snmpv3,and RMON 1 and 2
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Semantic Compression and Pattern Extraction with Fascicles

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient Construction of Regression Trees with Range and Region Splitting

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Learning bayesian network structure from massive datasets: the «sparse candidate« algorithm

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Hierarchical subspace sampling: a unified framework for high dimensional data reduction, selectivity estimation and nearest neighbor search

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
SPARTAN: using constrained models for guaranteed-error semantic compression

ACM SIGKDD Explorations Newsletter
Continuous queries over data streams

ACM SIGMOD Record
Compressed data cube for approximate OLAP query processing

Journal of Computer Science and Technology
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
Network Data Mining and Analysis: The NEMESIS Project

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Making every bit count: fast nonlinear axis scaling

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
ItCompress: An Iterative Semantic Compression Algorithm

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
An Efficient Subspace Sampling Framework for High-Dimensional Data Reduction, Selectivity Estimation, and Nearest-Neighbor Search

IEEE Transactions on Knowledge and Data Engineering
General purpose database summarization

VLDB '05 Proceedings of the 31st international conference on Very large data bases
How to wring a table dry: entropy compression of relations and querying of compressed relations

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A fast and effective method to find correlations among attributes in databases

Data Mining and Knowledge Discovery
RadixZip: linear time compression of token streams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Mine your own business, mine others' news!

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Approximate lineage for probabilistic databases

Proceedings of the VLDB Endowment
Degrees of conditional (in)dependence: A framework for approximate Bayesian networks and examples related to the rough set-based feature selection

Information Sciences: an International Journal
Time sequence summarization to scale up chronology-dependent applications

Proceedings of the 18th ACM conference on Information and knowledge management
A performance evaluation framework for association mining in spatial data

Journal of Intelligent Information Systems
Document decomposition for XML compression: a heuristic approach

DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
Exploiting cluster analysis for constructing multi-dimensional histograms on both static and evolving data

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Data summarization for network traffic monitoring

Journal of Network and Computer Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such techniques are clearly motivated by the ever-increasing data collection rates of modern enterprises and the need for effective, guaranteed-quality approximate answers to queries over massive relational data sets. In this paper, we propose SPARTAN, a system that takes advantage of attribute semantics and data-mining models to perform lossy compression of massive data tables. SPARTAN is based on the novel idea of exploiting predictive data correlations and prescribed error tolerances for individual attributes to construct concise and accurate Classification and Regression Tree (CaRT) models for entire columns of a table. More precisely, SPARTAN selects a certain subset of attributes for which no values are explicitly stored in the compressed table; instead, concise CaRTs that predict these values (within the prescribed error bounds) are maintained. To restrict the huge search space and construction cost of possible CaRT predictors, SPARTAN employs sophisticated learning techniques and novel combinatorial optimization algorithms. Our experimentation with several real-life data sets offers convincing evidence of the effectiveness of SPARTAN's model-based approach — SPARTAN is able to consistently yield substantially better compression ratios than existing semantic or syntactic compression tools (e.g., gzip) while utilizing only small data samples for model inference.