Data generation using declarative constraints

Authors:
Arvind Arasu;Raghav Kaushik;Jian Li
Affiliations:
Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;University of Maryland, College Park, MD, USA
Venue:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Year:
2011

Citing 22
Cited 5

Simple linear-time algorithms to test chordality of graphs, test acyclicity of hypergraphs, and selectively reduce acyclic hypergraphs

SIAM Journal on Computing
Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Quickly generating billion-record synthetic databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
An overview of data warehousing and OLAP technology

ACM SIGMOD Record
STHoles: a multidimensional workload-aware histogram

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Selectivity estimation using probabilistic models

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
k-anonymity: a model for protecting privacy

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Flexible database generators

VLDB '05 Proceedings of the 31st international conference on Very large data bases
ISOMER: Consistent Histogram Construction Using Query Feedback

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Simple and realistic data generation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Generating Queries with Cardinality Constraints for DBMS Testing

IEEE Transactions on Knowledge and Data Engineering
L-diversity: Privacy beyond k-anonymity

ACM Transactions on Knowledge Discovery from Data (TKDD)
QAGen: generating query-aware test databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Learning Factor Graphs in Polynomial Time and Sample Complexity

The Journal of Machine Learning Research
Generating XML structure using examples and constraints

Proceedings of the VLDB Endowment
Affiliation networks

Proceedings of the forty-first annual ACM symposium on Theory of computing
Generating example data for dataflow programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Combinatorial Optimization: Theory and Algorithms

Combinatorial Optimization: Theory and Algorithms
Understanding cardinality estimation using entropy maximization

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Generating databases for query workloads

Proceedings of the VLDB Endowment
Differential privacy

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part II
Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases

Tiresias: the database oracle for how-to queries

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Count constraints and the inverse OLAP problem: definition, complexity and a step toward aggregate data exchange

FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Scalable test data generation from multidimensional models

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
Reversing statistics for scalable test databases generation

Proceedings of the Sixth International Workshop on Testing Database Systems
Issues in big data testing and benchmarking

Proceedings of the Sixth International Workshop on Testing Database Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the problem of generating synthetic databases having declaratively specified characteristics. This problem is motivated by database system and application testing, data masking, and benchmarking. While the data generation problem has been studied before, prior approaches are either non-declarative or have fundamental limitations relating to data characteristics that they can capture and efficiently support. We argue that a natural, expressive, and declarative mechanism for specifying data characteristics is through cardinality constraints; a cardinality constraint specifies that the output of a query over the generated database have a certain cardinality. While the data generation problem is intractable in general, we present efficient algorithms that can handle a large and useful class of constraints. We include a thorough empirical evaluation illustrating that our algorithms handle complex constraints, scale well as the number of constraints increase, and outperform applicable prior techniques.