There goes the neighborhood: performance degradation due to nearby jobs

  • Authors:
  • Abhinav Bhatele;Kathryn Mohror;Steven H. Langer;Katherine E. Isaacs

  • Affiliations:
  • Lawrence Livermore National Laboratory, Livermore, California;Lawrence Livermore National Laboratory, Livermore, California;Lawrence Livermore National Laboratory, Livermore, California;University of California, Davis, California

  • Venue:
  • SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Predictable performance is important for understanding and alleviating application performance issues; quantifying the effects of source code, compiler, or system software changes; estimating the time required for batch jobs; and determining the allocation requests for proposals. Our experiments show that on a Cray XE system, the execution time of a communication-heavy parallel application ranges from 28% faster to 41% slower than the average observed performance. Blue Gene systems, on the other hand, demonstrate no noticeable run-to-run variability. In this paper, we focus on Cray machines and investigate potential causes for performance variability such as OS jitter, shape of the allocated partition, and interference from other jobs sharing the same network links. Reducing such variability could improve overall throughput at a computer center and save energy costs.