A closer look at coscheduling approaches for a network of workstations
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Implicit coscheduling: coordinated scheduling with implicit information in distributed systems
ACM Transactions on Computer Systems (TOCS)
BProc: the Beowulf distributed process space
ICS '02 Proceedings of the 16th international conference on Supercomputing
PM: An Operating System Coordinated High Performance Communication Library
HPCN Europe '97 Proceedings of the International Conference and Exhibition on High-Performance Computing and Networking
A Gang-Scheduling System for ASCI Blue-Pacific
HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Packing Schemes for Gang Scheduling
IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Improved Utilization and Responsiveness with Gang Scheduling
IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
STORM: lightning-fast resource management
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Adaptive Parallel Job Scheduling with Flexible Coscheduling
IEEE Transactions on Parallel and Distributed Systems
Hi-index | 14.98 |
Although clusters are a popular form of high-performance computing, they remain more difficult to manage than sequential systems—or even symmetric multiprocessors. In this paper, we identify a small set of primitive mechanisms that are sufficiently general to be used as building blocks to solve a variety of resource-management problems. We then present STORM, a resource-management environment that embodies these mechanisms in a scalable, low-overhead, and efficient implementation. The key innovation behind STORM is a modular software architecture that reduces all resource management functionality to a small number of highly scalable mechanisms. These mechanisms simplify the integration of resource management with low-level network features. As a result of this design, STORM can launch large, parallel applications an order of magnitude faster than the best time reported in the literature and can gang-schedule a parallel application as fast as the node OS can schedule a sequential application. This paper describes the mechanisms and algorithms behind STORM and presents a detailed performance model that shows that STORM's performance can scale to thousands of nodes.