Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization

Authors:
James Patton Jones;Bill Nitzberg
Affiliations:
-;-
Venue:
IPPS/SPDP '99/JSSPP '99 Proceedings of the Job Scheduling Strategies for Parallel Processing
Year:
1999

Citing 4
Cited 18

Job Management Requirements for NAS Parallel Systems and Clusters

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Job Scheduling Under the Portable Batch System

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
A Comparison of Workload Traces from Two Production Parallel Machines

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Utilization and Predictability in Scheduling the IBM SP2 with Backfilling

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium

Production Job Scheduling for Parallel Shared Memory Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Model for Moldable Supercomputer Jobs

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Critique of ESP

IPDPS '00/JSSPP '00 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
The Influence of the Structure and Sizes of Jobs on the Performance of Co-allocation

IPDPS '00/JSSPP '00 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
The Influence of Communication on the Performance of Co-allocation

JSSPP '01 Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
Selective Reservation Strategies for Backfill Job Scheduling

JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
Benefit of Limited Time Sharing in the Presence of Very Large Parallel Jobs

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
On the Scalability of Centralized Control

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Backfilling with lookahead to optimize the packing of parallel jobs

Journal of Parallel and Distributed Computing
PV-EASY: a strict fairness guaranteed and prediction enabled scheduler in parallel job scheduling

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Optimal job packing, a backfill scheduling optimization for a cluster of workstations

The Journal of Supercomputing
Multisite co-allocation algorithms for computational grid

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Using inaccurate estimates accurately

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Capacity estimation in HPC systems: simulation approach

ICDCIT'11 Proceedings of the 7th international conference on Distributed computing and internet technology
Are user runtime estimates inherently inaccurate?

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Pitfalls in parallel job scheduling evaluation

JSSPP'05 Proceedings of the 11th international conference on Job Scheduling Strategies for Parallel Processing
An analysis of computational workloads for the ORNL Jaguar system

Proceedings of the 26th ACM international conference on Supercomputing
Hierarchical scheduling strategies for parallel tasks and advance reservations in grids

Journal of Scheduling

Quantified Score

Hi-index	0.00

Visualization

Abstract

The NAS facility has operated parallel supercomputers for the past 11 years, including the Intel iPSC/860, Intel Paragon, Thinking Machines CM-5, IBM SP-2, and Cray Origin 2000. Across this wide variety of machine architectures, across a span of 10 years, across a large number of different users, and through thousands of minor configuration and policy changes, the utilization of these machines shows three general trends: (1) scheduling using a naive FCFS first-fit policy results in 40-60% utilization, (2) switching to the more sophisticated dynamic backfilling scheduling algorithm improves utilization by about 15 percentage points (yielding about 70% utilization), and (3) reducing the maximum allowable job size further increases utilization. Most surprising is the consistency of these trends. Over the lifetime of the NAS parallel systems, we made hundreds, perhaps thousands, of small changes to hardware, software, and policy, yet utilization was affected little. In particular, these results show that the goal of achieving near 100% utilization while supporting a real parallel supercomputing workload is unrealistic.