Generation of test databases using sampling methods

Authors:
Teodora Sandra Buda
Affiliations:
University College Dublin, Ireland
Venue:
Proceedings of the 2013 International Symposium on Software Testing and Analysis
Year:
2013

Citing 11
Cited 0

Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Histogram-Based Approximation of Set-Valued Query-Answers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Flexible database generators

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Privacy Preserving Database Generation for Database Application Testing

Fundamenta Informaticae - Special issue ISMIS'05
Linked Bernoulli Synopses: Sampling along Foreign Keys

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Analysis of sampling techniques for association rule mining

Proceedings of the 12th International Conference on Database Theory
Generating example data for dataflow programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MODA: automated test generation for database applications via mock objects

Proceedings of the IEEE/ACM international conference on Automated software engineering
A data generator for cloud-scale benchmarking

TPCTC'10 Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems
Sampling connected induced subgraphs uniformly at random

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Blink and it's done: interactive queries on very large data

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Populating the testing environment with relevant data represents a great challenge in software validation, generally requiring expert knowledge about the system under development, as its data critically impacts the outcome of the tests designed to assess the system. Current practices of populating the testing environments generally focus on developing efficient algorithms for generating synthetic data or use the production environment for testing purposes. The latter is an invaluable strategy to provide real test cases in order to discover issues that critically impact the user of the system. However, the production environment generally consists of large amounts of data that are difficult to handle and analyze. Database sampling from the production environment is a potential solution to overcome these challenges. In this research, we propose two database sampling methods, VFDS and CoDS, with the objective of populating the testing environment. The first method is a very fast random sampling approach, while the latter aims at preserving the distribution of data in order to produce a representative sample. In particular, we focus on the dependencies between the data from different tables and the method tries to preserve the distributions of these dependencies.