Programming Abstractions for Data Intensive Computing on Clouds and Grids

Authors:
Chris Miceli;Michael Miceli;Shantenu Jha;Hartmut Kaiser;Andre Merzky
Affiliations:
-;-;-;-;-
Venue:
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Year:
2009

Citing 3
Cited 7

Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Design and Implementation of Network Performance Aware Applications Using SAGA and Cactus

E-SCIENCE '07 Proceedings of the Third IEEE International Conference on e-Science and Grid Computing
Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities

HPCC '08 Proceedings of the 2008 10th IEEE International Conference on High Performance Computing and Communications

An overview of the Open Science Data Cloud

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Understanding application-level interoperability: Scaling-out MapReduce over high-performance grids and clouds

Future Generation Computer Systems
G-Hadoop: MapReduce across distributed data centers for data-intensive computing

Future Generation Computer Systems
Input data organization for batch processing in time window based computations

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Prolog programming with a map-reduce parallel construct

Proceedings of the 15th Symposium on Principles and Practice of Declarative Programming
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing
An improved partitioning mechanism for optimizing massive data analysis using MapReduce

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce has emerged as an important data-parallel programming model for data-intensive computing – for Clouds and Grids. However most if not all implementations of MapReduce are coupled to a specific infrastructure. SAGA is a high-level programming interface which provides the ability to create distributed applications in an infrastructure independent way. In this paper, we show how MapReduce has been implemented using SAGA and demonstrate its interoperability across different distributed platforms – Grids, Cloud-like infrastructure and Clouds. We discuss the advantages of programmatically developing MapReduce using SAGA, by demonstrating that the SAGA-based implementation is infrastructure independent whilst still providing control over the deployment, distribution and runtime decomposition. The ability to control the distribution and placement of the computation units (workers) is critical in order to implement the ability to move computational work to the data. This is required to keep data network transfer low and in the case of commercial Clouds the monetary cost of computing the solution low. Using data-sets of size up to 10GB, and upto 10 workers, we provide detailed performance analysis of the SAGA-MapReduce implementation, and show how controllingthe distribution of computation and the payload per worker helps enhance performance.