Myriad: parallel data generation on shared-nothing architectures

  • Authors:
  • Alexander Alexandrov;Berni Schiefer;John Poelman;Stephan Ewen;Thomas O. Bodner;Volker Markl

  • Affiliations:
  • TU Berlin, Germany;IBM Toronto Lab, Canada;IBM Silicon Valley Lab;TU Berlin, Germany;TU Berlin, Germany;TU Berlin, Germany

  • Venue:
  • Proceedings of the 1st Workshop on Architectures and Systems for Big Data
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The need for efficient data generation for the purposes of testing and benchmarking newly developed massively-parallel data processing systems has increased with the emergence of Big Data problems. As synthetic data model specifications evolve over time, the data generator programs implementing these models have to be adapted continuously -- a task that often becomes more tedious as the set of model constraints grows. In this paper we present Myriad - a new parallel data generation toolkit. Data generators created with the toolkit can quickly produce very large datasets in a shared-nothing parallel execution environment, while at the same time preserve with cross-partition dependencies, correlations and distributions in the generated data. In addition, we report on our efforts towards a benchmark suite for large-scale parallel analysis systems that uses Myriad for the generation of OLAP-style relational datasets.