MATE-EC2: a middleware for processing data with AWS

Authors:
Tekin Bicer;David Chiu;Gagan Agrawal
Affiliations:
Ohio State University, Columbus, OH, USA;Washington State University, Vancouver, WA, USA;Ohio State University, Columbus, OH, USA
Venue:
Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Year:
2011

Citing 12
Cited 1

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Amazon S3 for science grids: a viable solution?

DADC '08 Proceedings of the 2008 international workshop on Data-aware distributed computing
The cost of doing science on the cloud: the Montage example

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Cost-benefit analysis of Cloud Computing versus desktop grids

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
High-Performance Cloud Computing: A View of Scientific Applications

ISPAN '09 Proceedings of the 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks
A Map-Reduce System with an Alternate API for Multi-core Environments

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
MOON: MapReduce On Opportunistic eNvironments

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Maximizing efficiency by trading storage for computation

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Using proxies to accelerate cloud applications

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Towards optimizing hadoop provisioning in the cloud

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Time and Cost Sensitive Data-Intensive Computing on Hybrid Clouds

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, there has been growing interest in using Cloud resources for a variety of high performance and data-intensive applications. While there is currently a number of commercial Cloud service providers, Amazon Web Services (AWS) appears to be the most widely used. One of the main services that AWS offers is the Simple Storage Service (S3) for unbounded reliable storage of data, which is particularly amenable to data-intensive processes. Certainly, for these types of applications, we need support for effective retrieval and processing of data stored in S3 environments. In this paper, we focus on parallel and scalable processing of data stored in S3 using compute instances in AWS. We describe a middleware that allows the specification of data processing using a high-level API, which is a variant of the Map-Reduce paradigm. We show various optimizations, including data organization, job assignment, and data retrieval strategies, that can be leveraged based on the performance characteristics of S3. Our middleware is also capable of effectively using a heterogeneous collection of EC2 instances for data processing. Our detailed experimental study further evaluates what factors impact efficiency of retrieving and processing S3 data. We compare our middleware with Amazon Elastic Map-Reduce and show how we determine the best configuration for data processing on AWS.