Generating example data for dataflow programs

Authors:
Christopher Olston;Shubham Chopra;Utkarsh Srivastava
Affiliations:
Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Bangalore, India;Yahoo! Research, Santa Clara, CA, USA
Venue:
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Year:
2009

Citing 13
Cited 14

Test data for relational queries

PODS '86 Proceedings of the fifth ACM SIGACT-SIGMOD symposium on Principles of database systems
Automated Software Test Data Generation

IEEE Transactions on Software Engineering
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Introduction to constraint databases

Introduction to constraint databases
Run-time adaptation in river

ACM Transactions on Computer Systems (TOCS)
Supporting Fine-grained Data Lineage in a Database Visualization Environment

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Tioga: Providing Data Management Support for Scientific Visualization Applications

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
PODEM-X: An automatic test generation system for VLSI logic structures

DAC '81 Proceedings of the 18th Design Automation Conference
Aurora: a new model and architecture for data stream management

The VLDB Journal — The International Journal on Very Large Data Bases
QAGen: generating query-aware test databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Characterizing schema mappings via data examples

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Generating databases for query workloads

Proceedings of the VLDB Endowment
Qex: symbolic SQL query explorer

LPAR'10 Proceedings of the 16th international conference on Logic for programming, artificial intelligence, and reasoning
Designing and refining schema mappings via data examples

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Data generation using declarative constraints

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
New ideas track: testing mapreduce-style programs

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
Characterizing schema mappings via data examples

ACM Transactions on Database Systems (TODS)
Specification and verification of complex location events with panoramic

Pervasive'10 Proceedings of the 8th international conference on Pervasive Computing
Scalable test data generation from multidimensional models

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
Observing SQL queries in their natural habitat

ACM Transactions on Database Systems (TODS)
Extending XData to kill SQL query mutants in the wild

Proceedings of the Sixth International Workshop on Testing Database Systems
Generation of test databases using sampling methods

Proceedings of the 2013 International Symposium on Software Testing and Analysis
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

While developing data-centric programs, users often run (portions of) their programs over real data, to see how they behave and what the output looks like. Doing so makes it easier to formulate, understand and compose programs correctly, compared with examination of program logic alone. For large input data sets, these experimental runs can be time-consuming and inefficient. Unfortunately, sampling the input data does not always work well, because selective operations such as filter and join can lead to empty results over sampled inputs, and unless certain indexes are present there is no way to generate biased samples efficiently. Consequently new methods are needed for generating example input data for data-centric programs. We focus on an important category of data-centric programs, dataflow programs, which are best illustrated by displaying the series of intermediate data tables that occur between each pair of operations. We introduce and study the problem of generating example intermediate data for dataflow programs, in a manner that illustrates the semantics of the operators while keeping the example data small. We identify two major obstacles that impede naive approaches, namely (1) highly selective operators and (2) noninvertible operators, and offer techniques for dealing with these obstacles. Our techniques perform well on real dataflow programs used at Yahoo! for web analytics.