Information Processing Letters
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
An efficient relevant slicing method for debugging
ESEC/FSE-7 Proceedings of the 7th European software engineering conference held jointly with the 7th ACM SIGSOFT international symposium on Foundations of software engineering
Algorithms and Data Structures in VLSI Design
Algorithms and Data Structures in VLSI Design
Journal of Intelligent Information Systems
Supporting Fine-grained Data Lineage in a Database Visualization Environment
ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Chimera: AVirtual Data System for Representing, Querying, and Automating Data Derivation
SSDBM '02 Proceedings of the 14th International Conference on Scientific and Statistical Database Management
Geo-Opera: Workflow Concepts for Spatial Processes
SSD '97 Proceedings of the 5th International Symposium on Advances in Spatial Databases
Dynamic Slicing Method for Maintenance of Large C Programs
CSMR '01 Proceedings of the Fifth European Conference on Software Maintenance and Reengineering
Lineage Tracing in a Data Warehousing System
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Efficient Forward Computation of Dynamic Slices Using Reduced Ordered Binary Decision Diagrams
Proceedings of the 26th International Conference on Software Engineering
Lineage retrieval for scientific data processing: a survey
ACM Computing Surveys (CSUR)
Experimental evaluation of using dynamic slices for fault location
Proceedings of the sixth international symposium on Automated analysis-driven debugging
The virtual data grid: a new model and architecture for data-intensive collaboration
SSDBM '03 Proceedings of the 15th International Conference on Scientific and Statistical Database Management
An annotation management system for relational databases
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Recording and using provenance in a protein compressibility experiment
HPDC '05 Proceedings of the High Performance Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International Symposium
Deriving input syntactic structure from execution
Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Understanding provenance black boxes
Distributed and Parallel Databases
Strict control dependence and its effect on dynamic information flow analyses
Proceedings of the 19th international symposium on Software testing and analysis
TAPP'10 Proceedings of the 2nd conference on Theory and practice of provenance
Coalescing executions for fast uncertainty analysis
Proceedings of the 33rd International Conference on Software Engineering
Towards automated collection of application-level data provenance
TaPP'12 Proceedings of the 4th USENIX conference on Theory and Practice of Provenance
White box sampling in uncertain data processing enabled by program analysis
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Hi-index | 0.00 |
Tracing the lineage of data is an important requirement for establishing the quality and validity of data. Recently, the problem of data provenance has been increasingly addressed in database research. Earlier work has been limited to the lineage of data as it is manipulated using relational operations within an RDBMS. While this captures a very important aspect of scientific data processing, the existing work is incapable of handling the equally important, and prevalent, cases where the data is processed by non-relational operations. This is particularly common in scientific data where sophisticated processing is achieved by programs that are not part of a DBMS. The problem of tracking lineage when non-relational operators are used to process the data is particularly challenging since there is potentially no constraint on the nature of the processing. In this paper we propose a novel technique that overcomes this significant barrier and enables the tracing of lineage of data generated by an arbitrary function. Our technique works directly with the executable code of the function and does not require any high-level description of the function or even the source code. We establish the feasibility of our approach on a typical application and demonstrate that the technique is able to discern the correct lineage. Furthermore, it is shown that the method can help identify limitations in the function itself.