SPARTAN: using constrained models for guaranteed-error semantic compression

Authors:
Shivnath Babu;Minos Garofalakis;Rajeev Rastogi
Affiliations:
Stanford University, Stanford, CA;Bell Labs, Lucent Technologies, Murray Hill, NJ;Bell Labs, Lucent Technologies, Murray Hill, NJ
Venue:
ACM SIGKDD Explorations Newsletter
Year:
2002

Citing 15
Cited 1

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Inferring decision trees using the minimum description length principle

Information and Computation
Approximation algorithms for NP-hard problems

Approximation algorithms for NP-hard problems
Bayesian networks for lossless dataset compression

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Causality: models, reasoning, and inference

Causality: models, reasoning, and inference
Engineering the compression of massive tables: an experimental approach

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
SPARTAN: a model-based semantic compression system for massive data tables

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
SNMP,SNMPV2,Snmpv3,and RMON 1 and 2

SNMP,SNMPV2,Snmpv3,and RMON 1 and 2
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Block-Oriented Compression Techniques for Large Statistical Databases

IEEE Transactions on Knowledge and Data Engineering
Compressing Relations and Indexes

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Semantic Compression and Pattern Extraction with Fascicles

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Efficient Construction of Regression Trees with Range and Region Splitting

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases

A framework for abstracting data sources having heterogeneous representation formats

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such techniques are clearly motivated by the ever-increasing data collection rates of modern enterprises and the need for effective, guaranteed-quality approximate answers to queries over massive relational data sets.In this paper, we propose SPARTAN, a system that takes advantage of attribute semantics and data-mining models to perform lossy compression of massive data tables. SPARTAN is based on the novel idea of exploiting predictive data correlations and prescribed error-tolerance constraints for individual attributes to construct concise and accurate Classification and Regression Tree (CaRT) models for entire columns of a table. More precisely, SPARTAN selects a certain subset of attributes (referred to as predicted attributes) for which no values are explicitly stored in the compressed table; instead, concise error-constrained CaRTs that predict these values (within the prescribed error tolerances) are maintained. To restrict the huge search space of possible CaRT predictors, SPARTAN uses a Bayesian network structure to guide the selection of CaRT models that minimize the overall storage requirement, based on the prediction and materialization costs for each attribute. SPARTAN's CaRT-building algorithms employ novel integrated pruning strategies that take advantage of the given error constraints on individual attributes to minimize the computational effort involved. Our experimentation with several real-life data sets offers convincing evidence of the effectiveness of SPARTAN's model-based approach --- SPARTAN is able to consistently yield substantially better compression ratios than existing semantic or syntactic compression tools (e.g., gzip) while utilizing only small samples of the data for model inference.