STORM: lightning-fast resource management

Authors:
Eitan Frachtenberg;Fabrizio Petrini;Juan Fernandez;Scott Pakin;Salvador Coll
Affiliations:
Los Alamos National Laboratory;Los Alamos National Laboratory;Los Alamos National Laboratory;Los Alamos National Laboratory;Los Alamos National Laboratory
Venue:
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Year:
2002

Citing 22
Cited 21

The interaction of parallel and sequential workloads on a network of workstations

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Scheduling with implicit information in distributed systems

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
GLUnix: a global layer Unix for a network of workstations

Software—Practice & Experience - Special issue on multiprocessor operating systems
A closer look at coscheduling approaches for a network of workstations

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
An evaluation of parallel job scheduling for ASCI Blue-Pacific

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Using multicast to pre-load jobs on the ParPar cluster

Parallel Computing
Implicit coscheduling: coordinated scheduling with implicit information in distributed systems

ACM Transactions on Computer Systems (TOCS)
Highly efficient gang scheduling implementation

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
BProc: the Beowulf distributed process space

ICS '02 Proceedings of the 16th international conference on Supercomputing
Scalable parallel application launch on Cplant™

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
EMP: zero-copy OS-bypass NIC-driven gigabit ethernet message passing

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
The Quadrics Network: High-Performance Clustering Technology

IEEE Micro
InfiniBridge: An InfiniBand Channel Adapter with Integrated Switch

IEEE Micro
Efficient Multicast on Myrinet using Link-Level Flow Control

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
PM: An Operating System Coordinated High Performance Communication Library

HPCN Europe '97 Proceedings of the International Conference and Exhibition on High-Performance Computing and Networking
Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages

CANPC '00 Proceedings of the 4th International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Packing Schemes for Gang Scheduling

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Improved Utilization and Responsiveness with Gang Scheduling

IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
Buffered Coscheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
A General Predictive Performance Model for Wavefront Algorithms on Clusters of SMPs

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Gang Scheduling with Lightweight User-Level Communication

ICPPW '01 Proceedings of the 2001 International Conference on Parallel Processing Workshops

Collective communication patterns on the quadrics network

Performance analysis and grid computing
The Supercomputer Industry in Light of the Top500 Data

Computing in Science and Engineering
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scalable Hardware-Based Multicast Trees

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fast Scalable File Distribution Over Infiniband

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
On the Scalability of Centralized Control

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Monitoring and Debugging Parallel Software with BCS-MPI on Large-Scale Clusters

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
QsNetII: Defining High-Performance Network Design

IEEE Micro
Adaptive Parallel Job Scheduling with Flexible Coscheduling

IEEE Transactions on Parallel and Distributed Systems
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
STORM: Scalable Resource Management for Large-Scale Parallel Computers

IEEE Transactions on Computers
TakTuk, adaptive deployment of remote executions

Proceedings of the 18th ACM international symposium on High performance distributed computing
The Impact of noise on the scaling of collectives: the nearest neighbor model

HiPC'07 Proceedings of the 14th international conference on High performance computing
Adaptive connection management for scalable MPI over InfiniBand

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A multi-level scalable startup for parallel applications

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Service control with the preemptive parallel job scheduler Scojo-PECT

Cluster Computing
Impact of noise on scaling of collectives: an empirical evaluation

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
The impact of noise on the scaling of collectives: a theoretical approach

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Parallel job scheduling — a status report

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Fast and scalable startup of MPI programs in infiniband clusters

HiPC'04 Proceedings of the 11th international conference on High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although workstation clusters are a common platform for high-performance computing (HPC), they remain more difficult to manage than sequential systems or even symmetric multiprocessors. Furthermore, as cluster sizes increase, the quality of the resource-management subsystem---essentially, all of the code that runs on a cluster other than the applications---increasingly impacts application efficiency. In this paper, we present STORM, a resource-management framework designed for scalability and performance. The key innovation behind STORM is a software architecture that enables resource management to exploit low-level network features. As a result of this HPC-application-like design, STORM is orders of magnitude faster than the best reported results in the literature on two sample resource-management functions: job launching and process scheduling.