Hierarchical Dynamics, Interarrival Times, and Performance
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Addressing Sporadic Contention on Shared Computing Clusters
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Workload management of cooperatively federated computing clusters
The Journal of Supercomputing
Aggregate modeling for TCP sessions
Proceedings of the 2nd ACM international workshop on Wireless multimedia networking and performance modeling
Performance Evaluation of Overload Control in Multi-cluster Grids
GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
Managing irregular workloads of cooperatively shared computing clusters
ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
This paper characterizes "queue storms" in supercomputer systems and discusses methods for quelling them. Queue storms are anomalously large queue lengths dependent upon the job size mix, the queuing system, the machine size, and correlations and dependencies between job submissions. We use synthetic data generated from actual job log data from the ASCI Blue Mountain supercomputer combined with different long-range dependencies. We show the distribution of times from the first storm to occur, which is in a sense the time when the machine becomes obsolete because it represents the time when the machine first fails to provide satisfactory turnaround. To overcome queue storms, more resources are needed even if they appear superfluous most of the time. We present two methods, including a grid-based solution, for reducing these correlations and their resulting effect on the size and frequency of queue storms.