Lineage tracing for general data warehouse transformations
The VLDB Journal — The International Journal on Very Large Data Bases
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Provenance in Databases: Why, How, and Where
Foundations and Trends in Databases
Efficient querying and maintenance of network provenance at internet-scale
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Hyracks: A flexible and extensible foundation for data-intensive computing
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Provenance-based refresh in data-oriented workflows
Proceedings of the 20th ACM international conference on Information and knowledge management
PowerGraph: distributed graph-parallel computation on natural graphs
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Hi-index | 0.00 |
A fundamental challenge for big-data analytics is how to efficiently tune and debug multi-step dataflows. This paper presents Newt, a scalable architecture for capturing and using record-level data lineage to discover and resolve errors in analytics. Newt's flexible instrumentation allows system developers to collect this fine-grain lineage from a range of data intensive scalable computing (DISC) architectures, actively recording the flow of data through multi-step, user-defined transformations. Newt pairs this API with a scale-out, fault-tolerant lineage store and query engine. We find that while active collection can be expensive, it incurs modest runtime overheads for real-world analytics (