MATE-EC2: a middleware for processing data with AWS

  • Authors:
  • Tekin Bicer;David Chiu;Gagan Agrawal

  • Affiliations:
  • Ohio State University, Columbus, OH, USA;Washington State University, Vancouver, WA, USA;Ohio State University, Columbus, OH, USA

  • Venue:
  • Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recently, there has been growing interest in using Cloud resources for a variety of high performance and data-intensive applications. While there is currently a number of commercial Cloud service providers, Amazon Web Services (AWS) appears to be the most widely used. One of the main services that AWS offers is the Simple Storage Service (S3) for unbounded reliable storage of data, which is particularly amenable to data-intensive processes. Certainly, for these types of applications, we need support for effective retrieval and processing of data stored in S3 environments. In this paper, we focus on parallel and scalable processing of data stored in S3 using compute instances in AWS. We describe a middleware that allows the specification of data processing using a high-level API, which is a variant of the Map-Reduce paradigm. We show various optimizations, including data organization, job assignment, and data retrieval strategies, that can be leveraged based on the performance characteristics of S3. Our middleware is also capable of effectively using a heterogeneous collection of EC2 instances for data processing. Our detailed experimental study further evaluates what factors impact efficiency of retrieving and processing S3 data. We compare our middleware with Amazon Elastic Map-Reduce and show how we determine the best configuration for data processing on AWS.