Efficient update data generation for DBMS benchmarks

Authors:
Michael Frank;Meikel Poess;Tilmann Rabl
Affiliations:
University of Passau, Passau, Germany;Oracle Corporation, Redwood City, CA, USA;University of Toronto, Toronto, ON, Canada
Venue:
ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Year:
2012

Citing 17
Cited 4

Quickly generating billion-record synthetic databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
New TPC benchmarks for decision support and web commerce

ACM SIGMOD Record
Benchmarking Database Systems A Systematic Approach

VLDB '83 Proceedings of the 9th International Conference on Very Large Data Bases
MUDD: a multi-dimensional data generator

WOSP '04 Proceedings of the 4th international workshop on Software and performance
Flexible database generators

VLDB '05 Proceedings of the 31st international conference on Very large data bases
On the xorshift random number generators

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems

ITNG '06 Proceedings of the Third International Conference on Information Technology: New Generations
The making of TPC-DS

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Simple and realistic data generation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A parallel general-purpose synthetic data generator

ACM SIGMOD Record
The Art of Building a Good Benchmark

Performance Evaluation and Benchmarking
Generating Shifting Workloads to Benchmark Adaptability in Relational Database Systems

Performance Evaluation and Benchmarking
Principles for an ETL Benchmark

Performance Evaluation and Benchmarking
Benchmarking cloud serving systems with YCSB

Proceedings of the 1st ACM symposium on Cloud computing
A data generator for cloud-scale benchmarking

TPCTC'10 Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems
Parallel data generation for performance analysis of large, complex RDBMS

Proceedings of the Fourth International Workshop on Testing Database Systems
A PDGF implementation for TPC-H

TPCTC'11 Proceedings of the Third TPC Technology conference on Topics in Performance Evaluation, Measurement and Characterization

BigBench: towards an industry standard benchmark for big data analytics

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Rapid development of data generators using meta generators in PDGF

Proceedings of the Sixth International Workshop on Testing Database Systems
Variations of the star schema benchmark to test the effects of data skew on query performance

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Issues in big data testing and benchmarking

Proceedings of the Sixth International Workshop on Testing Database Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is without doubt that industry standard benchmarks have been proven to be crucial to the innovation and productivity of the computing industry. They are important to the fair and standardized assessment of performance across different vendors, different system versions from the same vendor and across different architectures. Good benchmarks are even meant to drive industry and technology forward. Since at some point, after all reasonable advances have been made using a particular benchmark even good benchmarks become obsolete over time. This is why standard consortia periodically overhaul their existing benchmarks or develop new benchmarks. An extremely time and resource consuming task in the creation of new benchmarks is the development of benchmark generators, especially because benchmarks tend to become more and more complex. The first version of the Parallel Data Generation Framework (PDGF), a generic data generator, was capable of generating data for the initial load of arbitrary relational schemas. It was, however, not able to generate data for the actual workload, i.e. input data for transactions (insert, delete and update), incremental load etc., mainly because it did not understand the notion of updates. Updates are data changes that occur over time, e.g. a customer changes address, switches job, gets married or has children. Many benchmarks, need to reflect these changes during their workloads. In this paper we present PDGF Version 2, which contains extensions enabling the generation of update data.