Quickly generating billion-record synthetic databases
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
A parallel general-purpose synthetic data generator
ACM SIGMOD Record
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
E = MC3: managing uncertain enterprise data in a cluster-computing environment
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing
Proceedings of the 1st ACM symposium on Cloud computing
A data generator for cloud-scale benchmarking
TPCTC'10 Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems
ASTERIX: towards a scalable, semistructured data platform for evolving-world models
Distributed and Parallel Databases
Myriad: scalable and expressive data generation
Proceedings of the VLDB Endowment
Issues in big data testing and benchmarking
Proceedings of the Sixth International Workshop on Testing Database Systems
Hi-index | 0.00 |
The need for efficient data generation for the purposes of testing and benchmarking newly developed massively-parallel data processing systems has increased with the emergence of Big Data problems. As synthetic data model specifications evolve over time, the data generator programs implementing these models have to be adapted continuously -- a task that often becomes more tedious as the set of model constraints grows. In this paper we present Myriad - a new parallel data generation toolkit. Data generators created with the toolkit can quickly produce very large datasets in a shared-nothing parallel execution environment, while at the same time preserve with cross-partition dependencies, correlations and distributions in the generated data. In addition, we report on our efforts towards a benchmark suite for large-scale parallel analysis systems that uses Myriad for the generation of OLAP-style relational datasets.