Linked Bernoulli Synopses: Sampling along Foreign Keys

Authors:
Rainer Gemulla;Philipp Rösch;Wolfgang Lehner
Affiliations:
Database Technology Group, Technische Universität Dresden, Germany;Database Technology Group, Technische Universität Dresden, Germany;Database Technology Group, Technische Universität Dresden, Germany
Venue:
SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Year:
2008

Citing 9
Cited 3

On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Congressional samples for approximate answering of group-by queries

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Selectivity estimation using probabilistic models

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Monotonic Optimization: Problems and Solution Approaches

SIAM Journal on Optimization
Overcoming Limitations of Sampling for Aggregation Queries

Proceedings of the 17th International Conference on Data Engineering
Dynamic sample selection for approximate query processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Graph-based synopses for relational selectivity estimation

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Designing Random Sample Synopses with Outliers

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

A sample advisor for approximate query processing

ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
Generation of test databases using sampling methods

Proceedings of the 2013 International Symposium on Software Testing and Analysis
Optimizing Sample Design for Approximate Query Processing

International Journal of Knowledge-Based Organizations

Quantified Score

Hi-index	0.00

Visualization

Abstract

Random sampling is a popular technique for providing fast approximate query answers, especially in data warehouse environments. Compared to other types of synopses, random sampling bears the advantage of retaining the dataset's dimensionality; it also associates probabilistic error bounds with the query results. Most of the available sampling techniques focus on table-level sampling, that is, they produce a sample of only a single database table. Queries that contain joins over multiple tables cannot be answered with such samples because join results on random samples are often small and skewed. On the contrary, schema-level sampling techniques by design support queries containing joins. In this paper, we introduce Linked Bernoulli Synopses, a schema-level sampling scheme based upon the well-known Join Synopses. Both schemes rely on the idea of maintaining foreign-key integrity in the synopses; they are therefore suited to process queries containing arbitrary foreign-key joins. In contrast to Join Synopses, however, Linked Bernoulli Synopses correlate the sampling processes of the different tables in the database so as to minimize the space overhead, without destroying the uniformity of the individual samples. We also discuss how to compute Linked Bernoulli Synopses which maximize the effective sampling fraction for a given memory budget. The computation of the optimum solution is often computationally prohibitive so that approximate solutions are needed. We propose a simple heuristic approach which is fast and seems to produce close-to-optimum results in practice. We conclude the paper with an evaluation of our methods on both synthetic and real-world datasets.