Stateful bulk processing for incremental analytics

Authors:
Dionysios Logothetis;Christopher Olston;Benjamin Reed;Kevin C. Webb;Ken Yocum
Affiliations:
UCSD, La Jolla, CA, USA;Yahoo! Research, Santa Clara, USA;Yahoo! Research, Santa Clara, USA;UCSD, La Jolla, USA;UCSD, La Jolla, USA
Venue:
Proceedings of the 1st ACM symposium on Cloud computing
Year:
2010

Citing 19
Cited 30

Efficiently updating materialized views

SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Incremental Recomputation of Active Relational Expressions

IEEE Transactions on Knowledge and Data Engineering
Incremental Clustering for Mining in a Data Warehousing Environment

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
SINA: scalable incremental processing of continuous queries in spatio-temporal databases

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
User interactions in social networks and their implications

Proceedings of the 4th ACM European conference on Computer systems
Pregel: a system for large-scale graph processing - "ABSTRACT"

Proceedings of the 28th ACM symposium on Principles of distributed computing
DryadInc: reusing work in large-scale computations

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Nectar: automatic management of data and computation in datacenters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Large-scale incremental processing using distributed transactions and notifications

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Mesos: a platform for fine-grained resource sharing in the data center

Proceedings of the 8th USENIX conference on Networked systems design and implementation
A latency and fault-tolerance optimizer for online parallel query plans

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Nova: continuous Pig/Hadoop workflows

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
The case for being lazy: how to leverage lazy evaluation in MapReduce

Proceedings of the 2nd international workshop on Scientific cloud computing
A paradigm comparison for collecting TV channel statistics from high-volume channel zap events

Proceedings of the 5th ACM international conference on Distributed event-based system
Mining large distributed log data in near real time

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Incoop: MapReduce for incremental computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
PrIter: a distributed framework for prioritized iterative computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
Making time-stepped applications tick in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
Incremental recomputations in MapReduce

Proceedings of the third international workshop on Cloud data management
Geostreaming in cloud

Proceedings of the 2nd ACM SIGSPATIAL International Workshop on GeoStreaming
Kineograph: taking the pulse of a fast-changing and connected world

Proceedings of the 7th ACM european conference on Computer Systems
Large-scale incremental data processing with change propagation

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
iMapReduce: A Distributed Computing Framework for Iterative Computation

Journal of Grid Computing
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Towards effective partition management for large graphs

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
Facilitating real-time graph mining

Proceedings of the fourth international workshop on Cloud data management
Themis: an I/O-efficient MapReduce

Proceedings of the Third ACM Symposium on Cloud Computing
Streaming big data with self-adjusting computation

DDFP '13 Proceedings of the 2013 workshop on Data driven functional programming
Large-scale computation not at the cost of expressiveness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
i2MapReduce: incremental iterative MapReduce

Proceedings of the 2nd International Workshop on Cloud Intelligence
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Discretized streams: fault-tolerant streaming computation at scale

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
MillWheel: fault-tolerant stream processing at internet scale

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work addresses the need for stateful dataflow programs that can rapidly sift through huge, evolving data sets. These data-intensive applications perform complex multi-step computations over successive generations of data inflows, such as weekly web crawls, daily image/video uploads, log files, and growing social networks. While programmers may simply re-run the entire dataflow when new data arrives, this is grossly inefficient, increasing result latency and squandering hardware resources and energy. Alternatively, programmers may use prior results to incrementally incorporate the changes. However, current large-scale data processing tools, such as Map-Reduce or Dryad, limit how programmers incorporate and use state in data-parallel programs. Straightforward approaches to incorporating state can result in custom, fragile code and disappointing performance. This work presents a generalized architecture for continuous bulk processing (CBP) that raises the level of abstraction for building incremental applications. At its core is a flexible, groupwise processing operator that takes state as an explicit input. Unifying stateful programming with a data-parallel operator affords several fundamental opportunities for minimizing the movement of data in the underlying processing system. As case studies, we show how one can use a small set of flexible dataflow primitives to perform web analytics and mine large-scale, evolving graphs in an incremental fashion. Experiments with our prototype using real-world data indicate significant data movement and running time reductions relative to current practice. For example, incrementally computing PageRank using CBP can reduce data movement by 46% and cut running time in half.