A Bayesian method for constructing Bayesian belief networks from databases
Proceedings of the seventh conference (1991) on Uncertainty in artificial intelligence
Learning belief networks from data: an information theory based approach
CIKM '97 Proceedings of the sixth international conference on Information and knowledge management
Causality: models, reasoning, and inference
Causality: models, reasoning, and inference
Engineering the compression of massive tables: an experimental approach
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
SNMP,SNMPV2,Snmpv3,and RMON 1 and 2
SNMP,SNMPV2,Snmpv3,and RMON 1 and 2
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Semantic Compression and Pattern Extraction with Fascicles
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient Construction of Regression Trees with Range and Region Splitting
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Learning bayesian network structure from massive datasets: the «sparse candidate« algorithm
UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
SPARTAN: using constrained models for guaranteed-error semantic compression
ACM SIGKDD Explorations Newsletter
Continuous queries over data streams
ACM SIGMOD Record
Compressed data cube for approximate OLAP query processing
Journal of Computer Science and Technology
Approximate Query Processing: Taming the TeraBytes
Proceedings of the 27th International Conference on Very Large Data Bases
Network Data Mining and Analysis: The NEMESIS Project
PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Making every bit count: fast nonlinear axis scaling
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
ItCompress: An Iterative Semantic Compression Algorithm
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
IEEE Transactions on Knowledge and Data Engineering
General purpose database summarization
VLDB '05 Proceedings of the 31st international conference on Very large data bases
How to wring a table dry: entropy compression of relations and querying of compressed relations
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A fast and effective method to find correlations among attributes in databases
Data Mining and Knowledge Discovery
RadixZip: linear time compression of token streams
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Mine your own business, mine others' news!
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Approximate lineage for probabilistic databases
Proceedings of the VLDB Endowment
Information Sciences: an International Journal
Time sequence summarization to scale up chronology-dependent applications
Proceedings of the 18th ACM conference on Information and knowledge management
A performance evaluation framework for association mining in spatial data
Journal of Intelligent Information Systems
Document decomposition for XML compression: a heuristic approach
DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Data summarization for network traffic monitoring
Journal of Network and Computer Applications
Hi-index | 0.00 |
While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such techniques are clearly motivated by the ever-increasing data collection rates of modern enterprises and the need for effective, guaranteed-quality approximate answers to queries over massive relational data sets. In this paper, we propose SPARTAN, a system that takes advantage of attribute semantics and data-mining models to perform lossy compression of massive data tables. SPARTAN is based on the novel idea of exploiting predictive data correlations and prescribed error tolerances for individual attributes to construct concise and accurate Classification and Regression Tree (CaRT) models for entire columns of a table. More precisely, SPARTAN selects a certain subset of attributes for which no values are explicitly stored in the compressed table; instead, concise CaRTs that predict these values (within the prescribed error bounds) are maintained. To restrict the huge search space and construction cost of possible CaRT predictors, SPARTAN employs sophisticated learning techniques and novel combinatorial optimization algorithms. Our experimentation with several real-life data sets offers convincing evidence of the effectiveness of SPARTAN's model-based approach — SPARTAN is able to consistently yield substantially better compression ratios than existing semantic or syntactic compression tools (e.g., gzip) while utilizing only small data samples for model inference.