ACM Transactions on Database Systems (TODS)
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: A Distributed Storage System for Structured Data
ACM Transactions on Computer Systems (TOCS)
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Provenance in Databases: Why, How, and Where
Foundations and Trends in Databases
Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
Stateful bulk processing for incremental analytics
Proceedings of the 1st ACM symposium on Cloud computing
Comet: batched stream processing for data intensive distributed computing
Proceedings of the 1st ACM symposium on Cloud computing
DryadInc: reusing work in large-scale computations
HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
ZooKeeper: wait-free coordination for internet-scale systems
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Large-scale incremental processing using distributed transactions and notifications
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Managing rapidly-evolving scientific workflows
IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Incoop: MapReduce for incremental computations
Proceedings of the 2nd ACM Symposium on Cloud Computing
CoScan: cooperative scan sharing in the cloud
Proceedings of the 2nd ACM Symposium on Cloud Computing
Proceedings of the 2nd ACM SIGSPATIAL International Workshop on GeoStreaming
Large-scale incremental data processing with change propagation
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
The HaLoop approach to large-scale iterative data analysis
The VLDB Journal — The International Journal on Very Large Data Bases
Shredder: GPU-accelerated incremental storage and computation
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
The seven deadly sins of cloud computing research
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Muppet: MapReduce-style processing of fast data
Proceedings of the VLDB Endowment
Spotting code optimizations in data-parallel pipelines through PeriSCOPE
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Facilitating real-time graph mining
Proceedings of the fourth international workshop on Cloud data management
Streaming big data with self-adjusting computation
DDFP '13 Proceedings of the 2013 workshop on Data driven functional programming
Journal of Computer and System Sciences
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Exploiting application dynamism and cloud elasticity for continuous dataflows
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.01 |
This paper describes a workflow manager developed and deployed at Yahoo called Nova, which pushes continually-arriving data through graphs of Pig programs executing on Hadoop clusters. (Pig is a structured dataflow language and runtime for the Hadoop map-reduce system.) Nova is like data stream managers in its support for stateful incremental processing, but unlike them in that it deals with data in large batches using disk-based processing. Batched incremental processing is a good fit for a large fraction of Yahoo's data processing use-cases, which deal with continually-arriving data and benefit from incremental algorithms, but do not require ultra-low-latency processing.