How to juggle columns: an entropy-based approach for table compression

Authors:
Marcus Paradies;Christian Lemke;Hasso Plattner;Wolfgang Lehner;Kai-Uwe Sattler;Alexander Zeier;Jens Krueger
Affiliations:
SAP AG, Walldorf, Germany;SAP AG, Walldorf, Germany;Hasso-Plattner-Institute, Potsdam, Germany;SAP AG, Walldorf, Germany;Ilmenau University of Technology, Ilmenau, Germany;Hasso-Plattner-Institute, Potsdam, Germany;Hasso-Plattner-Institute, Potsdam, Germany
Venue:
Proceedings of the Fourteenth International Database Engineering & Applications Symposium
Year:
2010

Citing 8
Cited 2

Learning belief networks from data: an information theory based approach

CIKM '97 Proceedings of the sixth international conference on Information and knowledge management
STHoles: a multidimensional workload-aware histogram

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Exploiting statistics on query expressions for optimization

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Towards the Reverse Engineering of Denormalized Relational Databases

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Efficient Discovery of Functional and Approximate Dependencies Using Partitions

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Dependency Inference

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
SASH: a self-adaptive histogram set for dynamically changing workloads

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Extending functional dependency to detect abnormal data in RDF graphs

ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
Efficient transaction processing in SAP HANA database: the end of a column store myth

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many relational databases exhibit complex dependencies between data attributes, caused either by the nature of the underlying data or by explicitly denormalized schemas. In data warehouse scenarios, calculated key figures may be materialized or hierarchy levels may be held within a single dimension table. Such column correlations and the resulting data redundancy may result in additional storage requirements. They may also result in bad query performance if inappropriate independence assumptions are made during query compilation. In this paper, we tackle the specific problem of detecting functional dependencies between columns to improve the compression rate for column-based database systems, which both reduces main memory consumption and improves query performance. Although a huge variety of algorithms have been proposed for detecting column dependencies in databases, we maintain that increased data volumes and recent developments in hardware architectures demand novel algorithms with much lower runtime overhead and smaller memory footprint. Our novel approach is based on entropy estimations and exploits a combination of sampling and multiple heuristics to render it applicable for a wide range of use cases. We demonstrate the quality of our approach by means of an implementation within the SAP NetWeaver Business Warehouse Accelerator. Our experiments indicate that our approach scales well with the number of columns and produces reliable dependence structure information. This both reduces memory consumption and improves performance for nontrivial queries.