There goes the neighborhood: performance degradation due to nearby jobs

Authors:
Abhinav Bhatele;Kathryn Mohror;Steven H. Langer;Katherine E. Isaacs
Affiliations:
Lawrence Livermore National Laboratory, Livermore, California;Lawrence Livermore National Laboratory, Livermore, California;Lawrence Livermore National Laboratory, Livermore, California;University of California, Davis, California
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 7
Cited 0

Statistical scalability analysis of communication operations in distributed applications

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Exploring the Relationship Between Parallel Application Run-Time Variability and Network Performance in Clusters

LCN '03 Proceedings of the 28th Annual IEEE International Conference on Local Computer Networks
Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance variability of highly parallel architectures

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Measuring and Understanding Variation in Benchmark Performance

HPCMP-UGC '09 Proceedings of the 2009 DoD High Performance Computing Modernization Program Users Group Conference
Characterizing the Influence of System Noise on Large-Scale Applications by Simulation

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Predictable performance is important for understanding and alleviating application performance issues; quantifying the effects of source code, compiler, or system software changes; estimating the time required for batch jobs; and determining the allocation requests for proposals. Our experiments show that on a Cray XE system, the execution time of a communication-heavy parallel application ranges from 28% faster to 41% slower than the average observed performance. Blue Gene systems, on the other hand, demonstrate no noticeable run-to-run variability. In this paper, we focus on Cray machines and investigate potential causes for performance variability such as OS jitter, shape of the allocated partition, and interference from other jobs sharing the same network links. Reducing such variability could improve overall throughput at a computer center and save energy costs.