Statistical scalability analysis of communication operations in distributed applications
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Predictive performance and scalability modeling of a large-scale application
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Computer Organization and Design
Computer Organization and Design
Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
System noise, OS clock ticks, and fine-grained parallel applications
Proceedings of the 19th annual international conference on Supercomputing
A Performance Model of the Parallel Ocean Program
International Journal of High Performance Computing Applications
Analysis of microbenchmarks for performance tuning of clusters
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Designing and implementing lightweight kernels for capability computing
Concurrency and Computation: Practice & Experience
Towards a hardware fault-injection testbed to support reproducible resiliency experiments
Proceedings of the 2009 workshop on Resiliency in high performance
Investigating virtual passthrough I/O on commodity devices
ACM SIGOPS Operating Systems Review
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A hardware filesystem implementation with multidisk support
International Journal of Reconfigurable Computing - Special issue on selected papers from ReConFig 2008
Experiences with a Lightweight Supercomputer Kernel: Lessons Learned from Blue Gene's CNK
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Characterizing the Influence of System Noise on Large-Scale Applications by Simulation
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
jitSim: a simulator for predicting scalability of parallel applications in presence of OS jitter
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Using triggered operations to offload collective communication operations
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Minimal-overhead virtualization of a large scale supercomputer
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Linux kernel co-scheduling for bulk synchronous parallel applications
Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Extending and benchmarking the "Big Memory" implementation on Blue Gene/P Linux
Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Cache injection for parallel applications
Proceedings of the 20th international symposium on High performance distributed computing
The impact of injection bandwidth performance on application scalability
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Thread Tranquilizer: Dynamically reducing performance variation
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Region scheduling: efficiently using the cache architectures via page-level affinity
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Computational performance of ultra-high-resolution capability in the Community Earth System Model
International Journal of High Performance Computing Applications
Virtual InfiniBand clusters for HPC clouds
Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Linux kernel co-scheduling and bulk synchronous parallelism
International Journal of High Performance Computing Applications
Stepping towards noiseless Linux environment
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Software—Practice & Experience
Virtualizing HPC applications using modern hypervisors
Proceedings of the 2012 workshop on Cloud services, federation, and the 8th open cirrus summit
Concurrency and Computation: Practice & Experience
Parallel job scheduling for power constrained HPC systems
Parallel Computing
Application Performance on the Tri-Lab Linux Capacity Cluster-TLCC
International Journal of Distributed Systems and Technologies
The Red Storm Architecture and Early Experiences with Multi-Core Processors
International Journal of Distributed Systems and Technologies
The impact of system design parameters on application noise sensitivity
Cluster Computing
High performance cloud computing
Future Generation Computer Systems
Understanding and isolating the noise in the Linux kernel
International Journal of High Performance Computing Applications
Exascale design space exploration and co-design
Future Generation Computer Systems
Hi-index | 0.00 |
Operating system noise has been shown to be a key limiter of application scalability in high-end systems. While several studies have attempted to quantify the sources and effects of system interference using user-level mechanisms, there are few published studies on the effect of different kinds of kernel-generated noise on application performance at scale. In this paper, we examine the sensitivity of real-world, large-scale applications to a range of OS noise patterns using a kernel-based noise injection mechanism implemented in the Catamount lightweight kernel. Our results demonstrate the importance of how noise is generated, in terms of frequency and duration, and how this impact changes with application scale. For example, our results show that 2.5% net processor noise at 10,000 nodes can have no impact or can result in over a factor of 20 slowdown for the same application, depending solely on how the noise is generated. We also discuss how the characteristics of the applications we studied, for example computation/communication ratios, collective communication sizes, and other characteristics, related to their tendency to amplify or absorb noise. Finally, we discuss the implications of our findings on the design of new operating systems, middleware, and other system services for high-end parallel systems.