Incoop: MapReduce for incremental computations

Authors:
Pramod Bhatotia;Alexander Wieder;Rodrigo Rodrigues;Umut A. Acar;Rafael Pasquin
Affiliations:
Max Planck Institute for Software Systems (MPI-SWS);Max Planck Institute for Software Systems (MPI-SWS);Max Planck Institute for Software Systems (MPI-SWS);Max Planck Institute for Software Systems (MPI-SWS);Universidade Federal de Uberlândia (FACOM/UFU)
Venue:
Proceedings of the 2nd ACM Symposium on Cloud Computing
Year:
2011

Citing 20
Cited 28

A categorized bibliography on incremental computation

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Deriving Production Rules for Incremental View Maintenance

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Adaptive functional programming

ACM Transactions on Programming Languages and Systems (TOPLAS)
A proposal for parallel self-adjusting computation

Proceedings of the 2007 workshop on Declarative aspects of multicore programming
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
CEAL: a C-based language for self-adjusting computation

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
An experimental analysis of self-adjusting computation

ACM Transactions on Programming Languages and Systems (TOPLAS)
Stateful bulk processing for incremental analytics

Proceedings of the 1st ACM symposium on Cloud computing
Comet: batched stream processing for data intensive distributed computing

Proceedings of the 1st ACM symposium on Cloud computing
DryadInc: reusing work in large-scale computations

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Nectar: automatic management of data and computation in datacenters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Large-scale incremental processing using distributed transactions and notifications

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Nova: continuous Pig/Hadoop workflows

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
In-situ MapReduce for log processing

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Large-scale incremental data processing with change propagation

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing

Two for the price of one: a model for parallel and incremental computation

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Self-adjusting stack machines

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Kineograph: taking the pulse of a fast-changing and connected world

Proceedings of the 7th ACM european conference on Computer Systems
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Camdoop: exploiting in-network aggregation for big data applications

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Type-directed automatic incrementalization

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Using R for iterative and incremental processing

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
REX: recursive, delta-based data-centric computation

Proceedings of the VLDB Endowment
Muppet: MapReduce-style processing of fast data

Proceedings of the VLDB Endowment
Facilitating real-time graph mining

Proceedings of the fourth international workshop on Cloud data management
Streaming big data with self-adjusting computation

DDFP '13 Proceedings of the 2013 workshop on Data driven functional programming
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling

ACM Transactions on Architecture and Code Optimization (TACO)
An efficient quasi-identifier index based approach for privacy preservation over incremental data sets on cloud

Journal of Computer and System Sciences
Incremental stream processing using computational conflict-free replicated data types

Proceedings of the 3rd International Workshop on Cloud Data and Platforms
CARTILAGE: adding flexibility to the Hadoop skeleton

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
TimeStream: reliable stream computation in the cloud

Proceedings of the 8th ACM European Conference on Computer Systems
Modeling performance of a parallel streaming engine: bridging theory and costs

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
HadoopProv: towards provenance as a first class citizen in MapReduce

TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
HadoopProv: towards provenance as a first class citizen in MapReduce

Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance
Large-scale computation not at the cost of expressiveness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
The case for tiny tasks in compute clusters

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
i2MapReduce: incremental iterative MapReduce

Proceedings of the 2nd International Workshop on Cloud Intelligence
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Discretized streams: fault-tolerant streaming computation at scale

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Higher-Order reactive programming with incremental lists

ECOOP'13 Proceedings of the 27th European conference on Object-Oriented Programming
Warranties for faster strong consistency

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many online data sets evolve over time as new entries are slowly added and existing entries are deleted or modified. Taking advantage of this, systems for incremental bulk data processing, such as Google's Percolator, can achieve efficient updates. To achieve this efficiency, however, these systems lose compatibility with the simple programming models offered by non-incremental systems, e.g., MapReduce, and more importantly, requires the programmer to implement application-specific dynamic algorithms, ultimately increasing algorithm and code complexity. In this paper, we describe the architecture, implementation, and evaluation of Incoop, a generic MapReduce framework for incremental computations. Incoop detects changes to the input and automatically updates the output by employing an efficient, fine-grained result reuse mechanism. To achieve efficiency without sacrificing transparency, we adopt recent advances in the area of programming languages to identify the shortcomings of task-level memoization approaches, and to address these shortcomings by using several novel techniques: a storage system, a contraction phase for Reduce tasks, and an affinity-based scheduling algorithm. We have implemented Incoop by extending the Hadoop framework, and evaluated it by considering several applications and case studies. Our results show significant performance improvements without changing a single line of application code.