Reshaping text data for efficient processing on Amazon EC2

Authors:
Gabriela Turcu;Ian Foster;Svetlozar Nestorov
Affiliations:
University of Chicago, Chicago, Illinois;University of Chicago, Chicago, Illinois;Computation Institute, Chicago, Illinois
Venue:
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Year:
2010

Citing 8
Cited 0

Predicting Application Run Times Using Historical Information

IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Pace--A Toolset for the Performance Prediction of Parallel and Distributed Systems

International Journal of High Performance Computing Applications
Cost-Based Scheduling of Scientific Workflow Application on Utility Grids

E-SCIENCE '05 Proceedings of the First International Conference on e-Science and Grid Computing
Amazon S3 for science grids: a viable solution?

DADC '08 Proceedings of the 2008 international workshop on Data-aware distributed computing
The cost of doing science on the cloud: the Montage example

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Scientific computing using virtual high-performance computing: a case study using the Amazon elastic computing cloud

Proceedings of the 2008 annual research conference of the South African Institute of Computer Scientists and Information Technologists on IT research in developing countries: riding the wave of technology
Empirical evaluation of latency-sensitive application performance in the cloud

MMSys '10 Proceedings of the first annual ACM SIGMM conference on Multimedia systems
EC2 performance analysis for resource provisioning of service-oriented applications

ICSOC/ServiceWave'09 Proceedings of the 2009 international conference on Service-oriented computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text analysis tools are nowadays required to process increasingly large corpora which are often organized as small files (abstracts, news articles, etc). Cloud computing offers a convenient, on-demand, pay-as-you-go computing environment for solving such problems. We investigate provisioning on the Amazon EC2 cloud from the user perspective, attempting to provide a scheduling strategy that is both timely and cost effective. We rely on the empirical performance of the application of interest on smaller subsets of data, to construct an execution plan. A first goal of our performance measurements is to determine an optimal file size for our application to consume. Using the subset-sum first fit heuristic we reshape the input data by merging files in order to match as closely as possible the desired file size. This also speeds up the task of retrieving the results of our application, by having the output be less segmented. Using predictions of the performance of our application based on measurements on small data sets, we devise an execution plan that meets a user specified deadline while minimizing cost.