Cloud versus in-house cluster: evaluating Amazon cluster compute instances for running MPI applications

  • Authors:
  • Yan Zhai;Mingliang Liu;Jidong Zhai;Xiaosong Ma;Wenguang Chen

  • Affiliations:
  • Tsinghua university;Tsinghua university;Tsinghua university;North Carolina State University;Tsinghua university

  • Venue:
  • State of the Practice Reports
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The emergence of cloud services brings new possibilities for constructing and using HPC platforms. However, while cloud services provide the flexibility and convenience of customized, pay-as-you-go parallel computing, multiple previous studies in the past three years have indicated that cloud-based clusters need a significant performance boost to become a competitive choice, especially for tightly coupled parallel applications. In this work, we examine the feasibility of running HPC applications in clouds. This study distinguishes itself from existing investigations in several ways: 1) We carry out a comprehensive examination of issues relevant to the HPC community, including performance, cost, user experience, and range of user activities. 2) We compare an Amazon EC2-based platform built upon its newly available HPC-oriented virtual machines with typical local cluster and supercomputer options, using benchmarks and applications with scale and problem size unprecedented in previous cloud HPC studies. 3) We perform detailed performance and scalability analysis to locate the chief limiting factors of the state-of-the-art cloud based clusters. 4) We present a case study on the impact of per-application parallel I/O system configuration uniquely enabled by cloud services. Our results reveal that though the scalability of EC2-based virtual clusters still lags behind traditional HPC alternatives, they are rapidly gaining in overall performance and cost-effectiveness, making them feasible candidates for performing tightly coupled scientific computing. In addition, our detailed benchmarking and profiling discloses and analyzes several problems regarding the performance and performance stability on EC2.