TransMR: data-centric programming beyond data parallelism

Authors:
Naresh Rapolu;Karthik Kambatla;Suresh Jagannathan;Ananth Grama
Affiliations:
Dept. of Computer Science, Purdue University;Dept. of Computer Science, Purdue University;Dept. of Computer Science, Purdue University;Dept. of Computer Science, Purdue University
Venue:
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Year:
2011

Citing 13
Cited 3

On optimistic methods for concurrency control

ACM Transactions on Database Systems (TODS)
Towards robust distributed systems (abstract)

Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Large-scale incremental processing using distributed transactions and notifications

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
The tao of parallelism in algorithms

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Transactional locking II

DISC'06 Proceedings of the 20th international conference on Distributed Computing

Oolong: asynchronous distributed applications made easy

Proceedings of the Asia-Pacific Workshop on Systems
Oolong: asynchronous distributed applications made easy

APSys'12 Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems
Does RDMA-based enhanced Hadoop MapReduce need a new performance model?

Proceedings of the 4th annual Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce and related data-centric programming models have proven to be effective for a variety of large-scale distributed computations, in particular, those that manifest data parallelism. The fault-tolerance model underlying these programming environments relies on deterministic replay, which makes data-sharing (side-effects) across computations harder to support. This significantly limits the application scope of MapReduce and related models. This paper: (i) investigates data sharing (side-effects) in programming models operating on distributed key-value stores, specifically, the inconsistencies between the fault recovery mechanisms in execution and storage layers; (ii) defines semantics for a novel programming model, TransMR (Transactional MapReduce), which addresses these inconsistencies; and (iii) demonstrates broad application scope and enhanced performance through data-sharing across computations for a prototype implementation of the proposed semantics.