Nova: continuous Pig/Hadoop workflows

Authors:
Christopher Olston;Greg Chiou;Laukik Chitnis;Francis Liu;Yiping Han;Mattias Larsson;Andreas Neumann;Vellanki B.N. Rao;Vijayanand Sankarasubramanian;Siddharth Seth;Chao Tian;Topher ZiCornell;Xiaodan Wang
Affiliations:
Yahoo! Research, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA;Johns Hopkins University, Baltimore, MD, USA
Venue:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Year:
2011

Citing 15
Cited 14

Multiple-query optimization

ACM Transactions on Database Systems (TODS)
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Provenance in Databases: Why, How, and Where

Foundations and Trends in Databases
Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems

Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Stateful bulk processing for incremental analytics

Proceedings of the 1st ACM symposium on Cloud computing
Comet: batched stream processing for data intensive distributed computing

Proceedings of the 1st ACM symposium on Cloud computing
DryadInc: reusing work in large-scale computations

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
ZooKeeper: wait-free coordination for internet-scale systems

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Large-scale incremental processing using distributed transactions and notifications

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Managing rapidly-evolving scientific workflows

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data

Incoop: MapReduce for incremental computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
CoScan: cooperative scan sharing in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
Geostreaming in cloud

Proceedings of the 2nd ACM SIGSPATIAL International Workshop on GeoStreaming
Large-scale incremental data processing with change propagation

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
The HaLoop approach to large-scale iterative data analysis

The VLDB Journal — The International Journal on Very Large Data Bases
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Muppet: MapReduce-style processing of fast data

Proceedings of the VLDB Endowment
Spotting code optimizations in data-parallel pipelines through PeriSCOPE

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Facilitating real-time graph mining

Proceedings of the fourth international workshop on Cloud data management
Streaming big data with self-adjusting computation

DDFP '13 Proceedings of the 2013 workshop on Data driven functional programming
An efficient quasi-identifier index based approach for privacy preservation over incremental data sets on cloud

Journal of Computer and System Sciences
Data stream warehousing

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Exploiting application dynamism and cloud elasticity for continuous dataflows

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper describes a workflow manager developed and deployed at Yahoo called Nova, which pushes continually-arriving data through graphs of Pig programs executing on Hadoop clusters. (Pig is a structured dataflow language and runtime for the Hadoop map-reduce system.) Nova is like data stream managers in its support for stateful incremental processing, but unlike them in that it deals with data in large batches using disk-based processing. Batched incremental processing is a good fit for a large fraction of Yahoo's data processing use-cases, which deal with continually-arriving data and benefit from incremental algorithms, but do not require ultra-low-latency processing.