Predictive performance and scalability modeling of a large-scale application
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Paging tradeoffs in distributed-shared-memory multiprocessors
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Reducing the variance of point to point transfers in the IBM 9076 parallel computer
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
NAMD: biomolecular simulation on thousands of processors
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
STORM: lightning-fast resource management
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A General Predictive Performance Model for Wavefront Algorithms on Clusters of SMPs
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Performance of Various Computers Using Standard Linear Equations Software
Performance of Various Computers Using Standard Linear Equations Software
BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scaling molecular dynamics to 3000 processors with projections: a performance analysis case study
ICCS'03 Proceedings of the 2003 international conference on Computational science
Performance variability of highly parallel architectures
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Verifying large-scale system performance during installation using modelling
High performance scientific and engineering computing
An analysis of the impact of MPI overlap and independent progress
Proceedings of the 18th annual international conference on Supercomputing
A Performance and Scalability Analysis of the BlueGene/L Architecture
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Will Moore's Law Be Sufficient?
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Scalable Hardware-Based Multicast Trees
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scalable NIC-based Reduction on Large-scale Clusters
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Monitoring and Debugging Parallel Software with BCS-MPI on Large-Scale Clusters
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
International Journal of High Performance Computing Applications
IMPuLSE: integrated monitoring and profiling for large-scale environments
LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
A General Performance Model of Structured and Unstructured Mesh Particle Transport Computations
The Journal of Supercomputing
Scaling physics and material science applications on a massively parallel Blue Gene/L system
Proceedings of the 19th annual international conference on Supercomputing
System noise, OS clock ticks, and fine-grained parallel applications
Proceedings of the 19th annual international conference on Supercomputing
Adaptive Parallel Job Scheduling with Flexible Coscheduling
IEEE Transactions on Parallel and Distributed Systems
A Performance Model of the Parallel Ocean Program
International Journal of High Performance Computing Applications
Towards a framework for dedicated operating systems development in high-end computing systems
ACM SIGOPS Operating Systems Review
Right-weight kernels: an off-the-shelf alternative to custom light-weight kernels
ACM SIGOPS Operating Systems Review
Operating system issues for petascale systems
ACM SIGOPS Operating Systems Review
HPC-Colony: services and interfaces for very large systems
ACM SIGOPS Operating Systems Review
Kernel-level single system image for petascale computing
ACM SIGOPS Operating Systems Review
Scaling applications to massively parallel machines using Projections performance analysis tool
Future Generation Computer Systems
Performance feature identification by comparative trace analysis
Future Generation Computer Systems
Systems research challenges: a scale-out perspective
IBM Journal of Research and Development
A case for high performance computing with virtual machines
Proceedings of the 20th annual international conference on Supercomputing
STORM: Scalable Resource Management for Large-Scale Parallel Computers
IEEE Transactions on Computers
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Designing a highly-scalable operating system: the Blue Gene/L story
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Nomad: migrating OS-bypass networks in virtual machines
Proceedings of the 3rd international conference on Virtual execution environments
Fine grained kernel logging with KLogger: experience and insights
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
A study of process arrival patterns for MPI collective operations
Proceedings of the 21st annual international conference on Supercomputing
High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters
Proceedings of the 21st annual international conference on Supercomputing
The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops)
Proceedings of the 2007 workshop on Experimental computer science
The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops)
ecs'07 Experimental computer science on Experimental computer science
The blue gene/L supercomputer: a hardware and software story
International Journal of Parallel Programming
Integrated parallel performance views
Cluster Computing
Implications of application usage characteristics for collective communication offload
International Journal of High Performance Computing and Networking
NIC-based reduction algorithms for large-scale clusters
International Journal of High Performance Computing and Networking
Hard real-time performances in multiprocessor-embedded systems using ASMP-Linux
EURASIP Journal on Embedded Systems - Operating System Support for Embedded Real-Time Applications
Automatic software interference detection in parallel applications
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Evaluating the effect of replacing CNK with linux on the compute-nodes of blue gene/l
Proceedings of the 22nd annual international conference on Supercomputing
A regression-based approach to scalability prediction
Proceedings of the 22nd annual international conference on Supercomputing
Leveraging non-blocking collective communication in high-performance applications
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Open | SpeedShop: An open source infrastructure for parallel performance analysis
Scientific Programming - Large-Scale Programming Tools and Environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Characterizing application sensitivity to OS interference using kernel-level noise injection
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A dynamic scheduler for balancing HPC applications
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Performance comparison of PHP and JSP as server-side scripting languages
Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
Computational forces in the SAGE benchmark
Journal of Parallel and Distributed Computing
Creating private network overlays for high performance scientific computing
Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
A study of process arrival patterns for MPI collective operations
International Journal of Parallel Programming
Tuning parallel applications in parallel
Parallel Computing
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Measuring causal propagation of overhead of inefficiencies in parallel applications
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Diagnosing performance bottlenecks in emerging petascale applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Evaluating similarity-based trace reduction techniques for scalable performance analysis
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scaling applications to massively parallel machines using Projections performance analysis tool
Future Generation Computer Systems
Performance feature identification by comparative trace analysis
Future Generation Computer Systems
A PAPI implementation for BlueGene
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
The Impact of noise on the scaling of collectives: the nearest neighbor model
HiPC'07 Proceedings of the 14th international conference on High performance computing
Creating private network overlays for high performance scientific computing
MIDDLEWARE2007 Proceedings of the 8th ACM/IFIP/USENIX international conference on Middleware
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Experiences with a Lightweight Supercomputer Kernel: Lessons Learned from Blue Gene's CNK
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Characterizing the Influence of System Noise on Large-Scale Applications by Simulation
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
jitSim: a simulator for predicting scalability of parallel applications in presence of OS jitter
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Load balancing for regular meshes on SMPs with MPI
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
A look at application performance sensitivity to the bandwidth and latency of infiniband networks
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Dynamic performance prediction of an adaptive mesh application
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Achieving strong scaling with NAMD on blue Gene/L
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Performance analysis of parallel programs via message-passing graph traversal
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Adaptive connection management for scalable MPI over InfiniBand
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Self-similarity of parallel machines
Parallel Computing
Minimal-overhead virtualization of a large scale supercomputer
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Extending and benchmarking the "Big Memory" implementation on Blue Gene/P Linux
Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Predictive analysis of a hydrodynamics application on large-scale CMP clusters
Computer Science - Research and Development
An early performance analysis of POWER7-IH HPC systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The Combinatorial BLAS: design, implementation, and applications
International Journal of High Performance Computing Applications
Impact of noise on scaling of collectives: an empirical evaluation
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
A case for non-blocking collective operations
ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Thread Tranquilizer: Dynamically reducing performance variation
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Challenges and issues in benchmarking MPI
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Optimizing a conjugate gradient solver with non-blocking collective operations
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
PASCOM: power model for supercomputers
ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
The impact of noise on the scaling of collectives: a theoretical approach
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Evaluation of interconnection network performance under heavy non-uniform loads
ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
Assessing MPI performance on QsNetIIt
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Optimised gather collectives on QsNetII
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Parallel job scheduling — a status report
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Early experiences with KTAU on the IBM BG/L
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Early experience with scientific applications on the blue gene/l supercomputer
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
T-Alloc: A practical energy efficient resource allocation algorithm for traditional data centers
Future Generation Computer Systems
Optimal task assignment in multithreaded processors: a statistical approach
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Virtual InfiniBand clusters for HPC clouds
Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Trace profiling: Scalable event tracing on high-end parallel systems
Parallel Computing
Stepping towards noiseless Linux environment
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Software—Practice & Experience
Energy based performance tuning for large scale high performance computing systems
Proceedings of the 2012 Symposium on High Performance Computing
Virtualizing HPC applications using modern hypervisors
Proceedings of the 2012 workshop on Cloud services, federation, and the 8th open cirrus summit
Assessing the suitability of the NGMP multi-core processor in the space domain
Proceedings of the tenth ACM international conference on Embedded software
Concurrency and Computation: Practice & Experience
Application Performance on the Tri-Lab Linux Capacity Cluster-TLCC
International Journal of Distributed Systems and Technologies
The impact of system design parameters on application noise sensitivity
Cluster Computing
A comparative study of high-performance computing on the cloud
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
High performance cloud computing
Future Generation Computer Systems
Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Interference resilient PDES on multi-core systems: towards proportional slowdown
Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
Solving the straggler problem with bounded staleness
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Understanding and isolating the noise in the Linux kernel
International Journal of High Performance Computing Applications
There goes the neighborhood: performance degradation due to nearby jobs
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Using automated performance modeling to find scalability bugs in complex codes
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Enabling highly-scalable remote memory access programming with MPI-3 one sided
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Effective sampling-driven performance tools for GPU-accelerated supercomputers
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Fast pattern-specific routing for fat tree networks
ACM Transactions on Architecture and Code Optimization (TACO)
A performance comparison of current HPC systems: Blue Gene/Q, Cray XE6 and InfiniBand systems
Future Generation Computer Systems
Amdahl's law in the era of process variation
International Journal of High Performance Systems Architecture
Hi-index | 0.01 |
In this paper we describe how we improved the effective performance of ASCI Q, the world's second-fastest supercomputer, to meet our expectations. Using an arsenal of performance-analysis techniques including analytical models, custom microbenchmarks, full applications, and simulators, we succeeded in observing a serious-but previously undetected-performance problem. We identified the source of the problem, eliminated the problem, and "closed the loop" by demonstrating up to a factor of 2 improvement in application performance. We present our methodology and provide insight into performance analysis that is immediately applicable to other large-scale supercomputers.