Scalable lineage capture for debugging DISC analytics

Authors:
Dionysios Logothetis;Soumyarupa De;Kenneth Yocum
Affiliations:
Telefonica Research;Microsoft, Inc.;U.C. San Diego and Illumina, Inc.
Venue:
Proceedings of the 4th annual Symposium on Cloud Computing
Year:
2013

Citing 9
Cited 0

Lineage tracing for general data warehouse transformations

The VLDB Journal — The International Journal on Very Large Data Bases
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Provenance in Databases: Why, How, and Where

Foundations and Trends in Databases
Efficient querying and maintenance of network provenance at internet-scale

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Hyracks: A flexible and extensible foundation for data-intensive computing

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Secure network provenance

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Provenance-based refresh in data-oriented workflows

Proceedings of the 20th ACM international conference on Information and knowledge management
PowerGraph: distributed graph-parallel computation on natural graphs

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

A fundamental challenge for big-data analytics is how to efficiently tune and debug multi-step dataflows. This paper presents Newt, a scalable architecture for capturing and using record-level data lineage to discover and resolve errors in analytics. Newt's flexible instrumentation allows system developers to collect this fine-grain lineage from a range of data intensive scalable computing (DISC) architectures, actively recording the flow of data through multi-step, user-defined transformations. Newt pairs this API with a scale-out, fault-tolerant lineage store and query engine. We find that while active collection can be expensive, it incurs modest runtime overheads for real-world analytics (