Analysis of the early workload on the Cornell Theory Center IBM SP2
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Non-contiguous processor allocation algorithms for distributed memory multicomputers
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Job Scheduling in a Partitionable Mesh Using a Two-Dimensional Buddy System Partitioning Scheme
IEEE Transactions on Parallel and Distributed Systems
Allocating Precise Submeshes in Mesh Connected Systems
IEEE Transactions on Parallel and Distributed Systems
Job Characteristics of a Production Parallel Scientivic Workload on the NASA Ames iPSC/860
IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
The EASY - LoadLeveler API Project
IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Randomization, speculation, and adaptation in batch schedulers
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Using moldability to improve the performance of supercomputer jobs
Journal of Parallel and Distributed Computing
When the Herd Is Smart: Aggregate Behavior in the Selection of Job Request
IEEE Transactions on Parallel and Distributed Systems
Simulation Based HPC Workload Analysis
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Model for Moldable Supercomputer Jobs
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
The Influence of the Structure and Sizes of Jobs on the Performance of Co-allocation
IPDPS '00/JSSPP '00 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Effect of Job Size Characteristics on Job Scheduling Performance
IPDPS '00/JSSPP '00 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Characteristics of a Large Shared Memory Production Workload
JSSPP '01 Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
Parallel computer workload modeling with markov chains
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Hi-index | 0.01 |
This paper addresses job scheduling for parallel supercomputers. Modern parallel systems with n nodes can be used by jobs requesting up to n nodes. If less than n nodes are requested, multiple jobs can be run at the same time, allowing several users to use the system. One of the challenges for the operating system is to give reasonable service to a diverse group of requests. A single 1-node job that is running for a long time may effectively block the whole machine if the next job requests all n nodes. To date various policies have been proposed for the scheduling highly parallel computers. But as the users of current supercomputers know, these policies work far from perfect. This paper reports on the measurement of the usage of a 96-node Intel Paragon at ETH Zurich, a 512 node IBM SP2 at Cornell Theory Center, and a 512 node Cray T3D at Pittsburgh Supercomputing Center. We discuss the common characteristics of the different workloads and identify their impact on job scheduling techniques for such parallel systems. The metrics used for evaluating scheduling are based on turnaround time and fairness among jobs. We specifically show how two simple scheduling optimizations based on reordering the waiting queue can be used to effectively improve scheduling performance on real workloads. An important contribution of this paper is to establish that supercomputer workloads do exhibit some common characteristics but they also differ in important ways, and the knowledge of workloads is important for design of effective scheduling algorithms. Given the current ad-hoc approach applied to tuning the scheduler systems, these results are of interest to scheduling researchers, supercomputer installations, and developers of scheduling software.