STORM: Scalable Resource Management for Large-Scale Parallel Computers

  • Authors:
  • Eitan Frachtenberg;Fabrizio Petrini;Juan Fernandez;Scott Pakin

  • Affiliations:
  • -;-;-;-

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 2006

Quantified Score

Hi-index 14.98

Visualization

Abstract

Although clusters are a popular form of high-performance computing, they remain more difficult to manage than sequential systems—or even symmetric multiprocessors. In this paper, we identify a small set of primitive mechanisms that are sufficiently general to be used as building blocks to solve a variety of resource-management problems. We then present STORM, a resource-management environment that embodies these mechanisms in a scalable, low-overhead, and efficient implementation. The key innovation behind STORM is a modular software architecture that reduces all resource management functionality to a small number of highly scalable mechanisms. These mechanisms simplify the integration of resource management with low-level network features. As a result of this design, STORM can launch large, parallel applications an order of magnitude faster than the best time reported in the literature and can gang-schedule a parallel application as fast as the node OS can schedule a sequential application. This paper describes the mechanisms and algorithms behind STORM and presents a detailed performance model that shows that STORM's performance can scale to thousands of nodes.