Tracing lineage beyond relational operators

Authors:
Mingwu Zhang;Xiangyu Zhang;Xiang Zhang;Sunil Prabhakar
Affiliations:
Purdue University, West Lafayette, Indiana;Purdue University, West Lafayette, Indiana;Purdue University, West Lafayette, Indiana;Purdue University, West Lafayette, Indiana
Venue:
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Year:
2007

Citing 17
Cited 8

Dynamic program slicing

Information Processing Letters
Dynamic program slicing

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
An efficient relevant slicing method for debugging

ESEC/FSE-7 Proceedings of the 7th European software engineering conference held jointly with the 7th ACM SIGSOFT international symposium on Foundations of software engineering
Algorithms and Data Structures in VLSI Design

Algorithms and Data Structures in VLSI Design
Tracing Lineage of Array Data

Journal of Intelligent Information Systems
Supporting Fine-grained Data Lineage in a Database Visualization Environment

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Chimera: AVirtual Data System for Representing, Querying, and Automating Data Derivation

SSDBM '02 Proceedings of the 14th International Conference on Scientific and Statistical Database Management
Geo-Opera: Workflow Concepts for Spatial Processes

SSD '97 Proceedings of the 5th International Symposium on Advances in Spatial Databases
Dynamic Slicing Method for Maintenance of Large C Programs

CSMR '01 Proceedings of the Fifth European Conference on Software Maintenance and Reengineering
Lineage Tracing in a Data Warehousing System

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Efficient Forward Computation of Dynamic Slices Using Reduced Ordered Binary Decision Diagrams

Proceedings of the 26th International Conference on Software Engineering
Lineage retrieval for scientific data processing: a survey

ACM Computing Surveys (CSUR)
Experimental evaluation of using dynamic slices for fault location

Proceedings of the sixth international symposium on Automated analysis-driven debugging
The virtual data grid: a new model and architecture for data-intensive collaboration

SSDBM '03 Proceedings of the 15th International Conference on Scientific and Statistical Database Management
An annotation management system for relational databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Recording and using provenance in a protein compressibility experiment

HPDC '05 Proceedings of the High Performance Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International Symposium

Deriving input syntactic structure from execution

Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering
Why not?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Understanding provenance black boxes

Distributed and Parallel Databases
Strict control dependence and its effect on dynamic information flow analyses

Proceedings of the 19th international symposium on Software testing and analysis
Towards practical incremental recomputation for scientists: an implementation for the Python language

TAPP'10 Proceedings of the 2nd conference on Theory and practice of provenance
Coalescing executions for fast uncertainty analysis

Proceedings of the 33rd International Conference on Software Engineering
Towards automated collection of application-level data provenance

TaPP'12 Proceedings of the 4th USENIX conference on Theory and Practice of Provenance
White box sampling in uncertain data processing enabled by program analysis

Proceedings of the ACM international conference on Object oriented programming systems languages and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tracing the lineage of data is an important requirement for establishing the quality and validity of data. Recently, the problem of data provenance has been increasingly addressed in database research. Earlier work has been limited to the lineage of data as it is manipulated using relational operations within an RDBMS. While this captures a very important aspect of scientific data processing, the existing work is incapable of handling the equally important, and prevalent, cases where the data is processed by non-relational operations. This is particularly common in scientific data where sophisticated processing is achieved by programs that are not part of a DBMS. The problem of tracking lineage when non-relational operators are used to process the data is particularly challenging since there is potentially no constraint on the nature of the processing. In this paper we propose a novel technique that overcomes this significant barrier and enables the tracing of lineage of data generated by an arbitrary function. Our technique works directly with the executable code of the function and does not require any high-level description of the function or even the source code. We establish the feasibility of our approach on a typical application and demonstrate that the technique is able to discern the correct lineage. Furthermore, it is shown that the method can help identify limitations in the function itself.