The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performability modeling for scheduling and fault tolerance strategies for scientific workflows
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Tuning parallel applications in parallel
Parallel Computing
Consistent Application Performance at the Exascale
International Journal of High Performance Computing Applications
A multi-dimensional classification model for scientific workflow characteristics
Proceedings of the 1st International Workshop on Workflow Approaches to New Data-centric Science
WORKEM: Representing and Emulating Distributed Scientific Workflow Execution State
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Managing Variability in the IO Performance of Petascale Storage Systems
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Thread Tranquilizer: Dynamically reducing performance variation
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
The impact of noise on the scaling of collectives: a theoretical approach
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
There goes the neighborhood: performance degradation due to nearby jobs
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Predictable quality of service atop degradable distributed systems
Cluster Computing
Hi-index | 0.01 |
The design and evaluation of high performance computers has concentrated on increasing computational speed for applications. This performance is often measured on a well configured dedicated system to show the best case. In the real environment, resources are not always dedicated to a single task, and systems run tasks that may influence each other, so run times vary, sometimes to an unreasonably large extent. This paper explores the amount of variation seen across four large distributed memory systems in a systematic manner. It then analyzes the causes for the variations seen and discusses what can be done to decrease the variation without impacting performance.