Resource monitoring and management with OVIS to enable HPC in cloud computing environments

  • Authors:
  • Jim Brandt;Ann Gentile;Jackson Mayo;Philippe Pebay;Diana Roe;David Thompson;Matthew Wong

  • Affiliations:
  • Sandia National Laboratories, MS 9159, P.O. Box 969, Livermore, CA 94551 U.S.A.;Sandia National Laboratories, MS 9152, P.O. Box 969, Livermore, CA 94551 U.S.A.;Sandia National Laboratories, MS 9159, P.O. Box 969, Livermore, CA 94551 U.S.A.;Sandia National Laboratories, MS 9159, P.O. Box 969, Livermore, CA 94551 U.S.A.;Sandia National Laboratories, MS 9152, P.O. Box 969, Livermore, CA 94551 U.S.A.;Sandia National Laboratories, MS 9159, P.O. Box 969, Livermore, CA 94551 U.S.A.;Sandia National Laboratories, MS 9152, P.O. Box 969, Livermore, CA 94551 U.S.A.

  • Venue:
  • IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Using the cloud computing paradigm, a host of companies promise to make huge compute resources available to users on a pay-as-you-go basis. These resources can be configured on the fly to provide the hardware and operating system of choice to the customer on a large scale. While the current target market for these resources in the commercial space is web development/hosting, this model has the lure of savings of ownership, operation, and maintenance costs, and thus sounds like an attractive solution for people who currently invest millions to hundreds of millions of dollars annually on High Performance Computing (HPC) platforms in order to support large-scale scientific simulation codes. Given the current interconnect bandwidth and topologies utilized in these commercial offerings, however, the only current viable market in HPC would be small-memory-footprint embarrassingly parallel or loosely coupled applications, which inherently require little to no inter-processor communication. While providing the appropriate resources (bandwidth, latency, memory, etc.) for the HPC community would increase the potential to enable HPC in cloud environments, this would not address the need for scalability and reliability, crucial to HPC applications. Providing for these needs is particularly difficult in commercial cloud offerings where the number of virtual resources can far outstrip the number of physical resources, the resources are shared among many users, and the resources may be heterogeneous. Advanced resource monitoring, analysis, and configuration tools can help address these issues, since they bring the ability to dynamically provide and respond to information about the platform and application state and would enable more appropriate, efficient, and flexible use of the resources key to enabling HPC. Additionally such tools could be of benefit to non-HPC cloud providers, users, and applications by providing more efficient resource utilization in general.