Quickly generating billion-record synthetic databases

Authors:
Jim Gray;Prakash Sundaresan;Susanne Englert;Ken Baclawski;Peter J. Weinberger
Affiliations:
Digital Equipment Corporation, 455 Market, San Francisco, CA;Digital Equipment Corporation, 455 Market, San Francisco, CA;Tandem Computers Inc., 19333 Vallco Parkway, Cupertino, CA;Computer Science, Northeastern University, 360 Huntington Av. Boston, MA;Bell Laboratories, 600 Mountain Ave, Murry Hill, NJ
Venue:
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Year:
1994

Citing 16
Cited 84

Discrete logarithms in GF(p)

Algorithmica
Stochastic simulation

Stochastic simulation
Synchronized Disk Interleaving

IEEE Transactions on Computers
Operating systems: design and implementation

Operating systems: design and implementation
A benchmark of NonStop SQL release 2 demonstrating near-linear speedup and scaleup on large databases

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Rdb/VMS: a comprehensive guide

Rdb/VMS: a comprehensive guide
Numerical recipes in C (2nd ed.): the art of scientific computing

Numerical recipes in C (2nd ed.): the art of scientific computing
Limits to low-latency communication on high-speed networks

ACM Transactions on Computer Systems (TOCS)
A More Portable Fortran Random Number Generator

ACM Transactions on Mathematical Software (TOMS)
Parallel sorting on a shared-nothing architecture using probabilistic splitting

PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
The Art of Computer Programming Volumes 1-3 Boxed Set

The Art of Computer Programming Volumes 1-3 Boxed Set
The Gamma Database Machine Project

IEEE Transactions on Knowledge and Data Engineering
An Experiment on Response Time Scalability in Bubba

IWDM '89 Proceedings of the Sixth International Workshop on Database Machines
GAMMA - A High Performance Dataflow Database Machine

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Benchmarking Database Systems A Systematic Approach

VLDB '83 Proceedings of the 9th International Conference on Very Large Data Bases
Dataflow query processing using multiprocessor hash-partitioned algorithms (database, pipeline, parallelism)

Dataflow query processing using multiprocessor hash-partitioned algorithms (database, pipeline, parallelism)

Broadcast disks: data management for asymmetric communication environments

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Cubetree: organization of and bulk incremental updates on the data cube

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A framework for testing database applications

Proceedings of the 2000 ACM SIGSOFT international symposium on Software testing and analysis
Dynamic generation of data broadcasting programs for a broadcast disk array in a mobile computing environment

Proceedings of the ninth international conference on Information and knowledge management
A near optimal algorithm for generating broadcast programs on multiple channels

Proceedings of the tenth international conference on Information and knowledge management
Binary interpolation search for solution mapping on broadcast and on-demand channels in a mobile computing environment

Proceedings of the tenth international conference on Information and knowledge management
View selection using randomized search

Data & Knowledge Engineering
Data Allocation on Wireless Broadcast Channels for Efficient Query Processing

IEEE Transactions on Computers
Optimizing Index Allocation for Sequential Data Broadcasting in Wireless Mobile Computing

IEEE Transactions on Knowledge and Data Engineering
Divide-and-Conquer Algorithm for Computing Set Containment Joins

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Generalised Hash Teams for Join and Group-by

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Shared Index Scans for Data Warehouses

DaWaK '01 Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery
Adaptive algorithms for set containment joins

ACM Transactions on Database Systems (TODS)
The hB $^\Pi$-tree: a multi-attribute index supporting concurrency, recovery and node consolidation

The VLDB Journal — The International Journal on Very Large Data Bases
An efficient broadcast data clustering method for multipoint queries in wireless information systems

Journal of Systems and Software
Consistent database sampling as a database prototyping approach

Journal of Software Maintenance: Research and Practice
Efficient channel allocation tree generation for data broadcasting in a mobile computing environment

Wireless Networks
Aggregate view management in data warehouses

Handbook of massive data sets
Unified Fine-Granularity Buffering of Index and Data: Approach and Implementation

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Dynamic leveling: adaptive data broadcasting in a mobile computing environment

Mobile Networks and Applications
Caching and Scheduling for Broadcast Disk Systems

Journal of Experimental Algorithmics (JEA)
Technology for Testing Nondeterministic Client/Server Database Applications

IEEE Transactions on Software Engineering
MUDD: a multi-dimensional data generator

WOSP '04 Proceedings of the 4th international workshop on Software and performance
Using Applications of Data Versioning in Database Application Development

Proceedings of the 26th International Conference on Software Engineering
Privacy preserving database application testing

Proceedings of the 2003 ACM workshop on Privacy in the electronic society
PrefixCube: prefix-sharing condensed data cube

Proceedings of the 7th ACM international workshop on Data warehousing and OLAP
A Tree-Structured Index Allocation Method with Replication over Multiple Broadcast Channels in Wireless Environments

IEEE Transactions on Knowledge and Data Engineering
Data scheduling for multi-item and transactional requests in on-demand broadcast

Proceedings of the 6th international conference on Mobile data management
The TEXTURE benchmark: measuring performance of text queries on a relational DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Parallel execution of test runs for database application systems

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Flexible database generators

VLDB '05 Proceedings of the 31st international conference on Very large data bases
An Efficient Algorithm for Near Optimal Data Allocation on Multiple Broadcast Channels

Distributed and Parallel Databases
Testing database applications

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Answering top-k queries using views

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Simple and realistic data generation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Bulk insertion for R-trees by seeded clustering

Data & Knowledge Engineering
A framework for efficient regression tests on database applications

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient index and data allocation for wireless broadcast services

Data & Knowledge Engineering
A parallel general-purpose synthetic data generator

ACM SIGMOD Record
Adaptive aggregation on chip multiprocessors

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Parallel buffers for chip multiprocessors

DaMoN '07 Proceedings of the 3rd international workshop on Data management on new hardware
Privacy Preserving Database Generation for Database Application Testing

Fundamenta Informaticae - Special issue ISMIS'05
Generating targeted queries for database testing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Oracle database replay

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Creating realistic, scenario-based synthetic data for test and evaluation of information analytics software

Proceedings of the 2008 Workshop on BEyond time and errors: novel evaLuation methods for Information Visualization
Multi-RQP: generating test databases for the functional testing of OLTP applications

Proceedings of the 1st international workshop on Testing database systems
Dwarfs in the rearview mirror: how big are they really?

Proceedings of the VLDB Endowment
Building test cases and oracles to automate the testing of web database applications

Information and Software Technology
FCLOS: A client-server architecture for mobile OLAP

Data & Knowledge Engineering
Optimal splitters for database partitioning with size bounds

Proceedings of the 12th International Conference on Database Theory
Cache-conscious buffering for database operators with state

Proceedings of the Fifth International Workshop on Data Management on New Hardware
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs

Proceedings of the VLDB Endowment
Consistency rationing in the cloud: pay only when it matters

Proceedings of the VLDB Endowment
Improving the performance of list intersection

Proceedings of the VLDB Endowment
An evaluation of checkpoint recovery for massively multiplayer online games

Proceedings of the VLDB Endowment
A formal framework for database sampling

Information and Software Technology
A framework for testing DBMS features

The VLDB Journal — The International Journal on Very Large Data Bases
Benchmarking cloud serving systems with YCSB

Proceedings of the 1st ACM symposium on Cloud computing
Automatic contention detection and amelioration for data-intensive operations

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Constrained anonymization of production data: a constraint satisfaction problem approach

SDM'10 Proceedings of the 7th VLDB conference on Secure data management
Generating databases for query workloads

Proceedings of the VLDB Endowment
A data generator for cloud-scale benchmarking

TPCTC'10 Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems
Parallel data generation for performance analysis of large, complex RDBMS

Proceedings of the Fourth International Workshop on Testing Database Systems
Data generation using declarative constraints

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Scalable aggregation on multicore processors

Proceedings of the Seventh International Workshop on Data Management on New Hardware
UpStream: storage-centric load management for streaming applications with update semantics

The VLDB Journal — The International Journal on Very Large Data Bases
Energy efficiency for large-scale MapReduce workloads with significant interactive analysis

Proceedings of the 7th ACM european conference on Computer Systems
Efficient update data generation for DBMS benchmarks

ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Performance Evaluation of Range Queries in Key Value Stores

Journal of Grid Computing
CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A comparison of the use of virtual versus physical snapshots for supporting update-intensive workloads

DaMoN '12 Proceedings of the Eighth International Workshop on Data Management on New Hardware
Reordering rows for better compression: Beyond the lexicographic order

ACM Transactions on Database Systems (TODS)
Privacy Preserving Database Generation for Database Application Testing

Fundamenta Informaticae - Special issue ISMIS'05
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment
Myriad: parallel data generation on shared-nothing architectures

Proceedings of the 1st Workshop on Architectures and Systems for Big Data
Scalable test data generation from multidimensional models

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
Parallel analytics as a service

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
LinkBench: a database benchmark based on the Facebook social graph

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Rapid development of data generators using meta generators in PDGF

Proceedings of the Sixth International Workshop on Testing Database Systems
Reversing statistics for scalable test databases generation

Proceedings of the Sixth International Workshop on Testing Database Systems
UpSizeR: Synthetically scaling an empirical relational database

Information Systems
MICA: a holistic approach to fast in-memory key-value storage

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Evaluating database system performance often requires generating synthetic databases—ones having certain statistical properties but filled with dummy information. When evaluating different database designs, it is often necessary to generate several databases and evaluate each design. As database sizes grow to terabytes, generation often takes longer than evaluation. This paper presents several database generation techniques. In particular it discusses: (1) Parallelism to get generation speedup and scaleup. (2) Congruential generators to get dense unique uniform distributions. (3) Special-case discrete logarithms to generate indices concurrent to the base table generation. (4) Modification of (2) to get exponential, normal, and self-similar distributions.The discussion is in terms of generating billion-record SQL databases using C programs running on a shared-nothing computer system consisting of a hundred processors, with a thousand discs. The ideas apply to smaller databases, but large databases present the more difficult problems.