The interaction of parallel and sequential workloads on a network of workstations
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Scheduling with implicit information in distributed systems
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
GLUnix: a global layer Unix for a network of workstations
Software—Practice & Experience - Special issue on multiprocessor operating systems
A closer look at coscheduling approaches for a network of workstations
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
An evaluation of parallel job scheduling for ASCI Blue-Pacific
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Using multicast to pre-load jobs on the ParPar cluster
Parallel Computing
Implicit coscheduling: coordinated scheduling with implicit information in distributed systems
ACM Transactions on Computer Systems (TOCS)
Highly efficient gang scheduling implementation
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
BProc: the Beowulf distributed process space
ICS '02 Proceedings of the 16th international conference on Supercomputing
Scalable parallel application launch on Cplant™
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
EMP: zero-copy OS-bypass NIC-driven gigabit ethernet message passing
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Efficient Multicast on Myrinet using Link-Level Flow Control
ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
PM: An Operating System Coordinated High Performance Communication Library
HPCN Europe '97 Proceedings of the International Conference and Exhibition on High-Performance Computing and Networking
Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages
CANPC '00 Proceedings of the 4th International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Packing Schemes for Gang Scheduling
IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Improved Utilization and Responsiveness with Gang Scheduling
IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
Buffered Coscheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
A General Predictive Performance Model for Wavefront Algorithms on Clusters of SMPs
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Gang Scheduling with Lightweight User-Level Communication
ICPPW '01 Proceedings of the 2001 International Conference on Parallel Processing Workshops
Collective communication patterns on the quadrics network
Performance analysis and grid computing
The Supercomputer Industry in Light of the Top500 Data
Computing in Science and Engineering
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scalable Hardware-Based Multicast Trees
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fast Scalable File Distribution Over Infiniband
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
On the Scalability of Centralized Control
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Monitoring and Debugging Parallel Software with BCS-MPI on Large-Scale Clusters
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Adaptive Parallel Job Scheduling with Flexible Coscheduling
IEEE Transactions on Parallel and Distributed Systems
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
STORM: Scalable Resource Management for Large-Scale Parallel Computers
IEEE Transactions on Computers
TakTuk, adaptive deployment of remote executions
Proceedings of the 18th ACM international symposium on High performance distributed computing
The Impact of noise on the scaling of collectives: the nearest neighbor model
HiPC'07 Proceedings of the 14th international conference on High performance computing
Adaptive connection management for scalable MPI over InfiniBand
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A multi-level scalable startup for parallel applications
Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Service control with the preemptive parallel job scheduler Scojo-PECT
Cluster Computing
Impact of noise on scaling of collectives: an empirical evaluation
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
The impact of noise on the scaling of collectives: a theoretical approach
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Parallel job scheduling — a status report
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Fast and scalable startup of MPI programs in infiniband clusters
HiPC'04 Proceedings of the 11th international conference on High Performance Computing
Hi-index | 0.00 |
Although workstation clusters are a common platform for high-performance computing (HPC), they remain more difficult to manage than sequential systems or even symmetric multiprocessors. Furthermore, as cluster sizes increase, the quality of the resource-management subsystem---essentially, all of the code that runs on a cluster other than the applications---increasingly impacts application efficiency. In this paper, we present STORM, a resource-management framework designed for scalability and performance. The key innovation behind STORM is a software architecture that enables resource management to exploit low-level network features. As a result of this HPC-application-like design, STORM is orders of magnitude faster than the best reported results in the literature on two sample resource-management functions: job launching and process scheduling.