The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Authors:
Fabrizio Petrini;Darren J. Kerbyson;Scott Pakin
Affiliations:
Los Alamos National Laboratory, New Mexico;Los Alamos National Laboratory, New Mexico;Los Alamos National Laboratory, New Mexico
Venue:
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Year:
2003

Citing 11
Cited 124

Predictive performance and scalability modeling of a large-scale application

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Paging tradeoffs in distributed-shared-memory multiprocessors

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Reducing the variance of point to point transfers in the IBM 9076 parallel computer

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
The Quadrics Network: High-Performance Clustering Technology

IEEE Micro
NAMD: biomolecular simulation on thousands of processors

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
STORM: lightning-fast resource management

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A General Predictive Performance Model for Wavefront Algorithms on Clusters of SMPs

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Performance of Various Computers Using Standard Linear Equations Software

Performance of Various Computers Using Standard Linear Equations Software
BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scaling molecular dynamics to 3000 processors with projections: a performance analysis case study

ICCS'03 Proceedings of the 2003 international conference on Computational science
Performance variability of highly parallel architectures

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII

Verifying large-scale system performance during installation using modelling

High performance scientific and engineering computing
An analysis of the impact of MPI overlap and independent progress

Proceedings of the 18th annual international conference on Supercomputing
A Performance and Scalability Analysis of the BlueGene/L Architecture

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Will Moore's Law Be Sufficient?

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Scalable Hardware-Based Multicast Trees

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scalable NIC-based Reduction on Large-scale Clusters

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Monitoring and Debugging Parallel Software with BCS-MPI on Large-Scale Clusters

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Analyzing the Impact of Overlap, Offload, and Independent Progress for Message Passing Interface Applications

International Journal of High Performance Computing Applications
IMPuLSE: integrated monitoring and profiling for large-scale environments

LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
A General Performance Model of Structured and Unstructured Mesh Particle Transport Computations

The Journal of Supercomputing
Scaling physics and material science applications on a massively parallel Blue Gene/L system

Proceedings of the 19th annual international conference on Supercomputing
System noise, OS clock ticks, and fine-grained parallel applications

Proceedings of the 19th annual international conference on Supercomputing
Adaptive Parallel Job Scheduling with Flexible Coscheduling

IEEE Transactions on Parallel and Distributed Systems
A Performance Model of the Parallel Ocean Program

International Journal of High Performance Computing Applications
Towards a framework for dedicated operating systems development in high-end computing systems

ACM SIGOPS Operating Systems Review
Right-weight kernels: an off-the-shelf alternative to custom light-weight kernels

ACM SIGOPS Operating Systems Review
Operating system issues for petascale systems

ACM SIGOPS Operating Systems Review
HPC-Colony: services and interfaces for very large systems

ACM SIGOPS Operating Systems Review
Kernel-level single system image for petascale computing

ACM SIGOPS Operating Systems Review
Scaling applications to massively parallel machines using Projections performance analysis tool

Future Generation Computer Systems
Performance feature identification by comparative trace analysis

Future Generation Computer Systems
Systems research challenges: a scale-out perspective

IBM Journal of Research and Development
A case for high performance computing with virtual machines

Proceedings of the 20th annual international conference on Supercomputing
STORM: Scalable Resource Management for Large-Scale Parallel Computers

IEEE Transactions on Computers
A performance comparison through benchmarking and modeling of three leading supercomputers: blue Gene/L, Red Storm, and Purple

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Designing a highly-scalable operating system: the Blue Gene/L story

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Nomad: migrating OS-bypass networks in virtual machines

Proceedings of the 3rd international conference on Virtual execution environments
Fine grained kernel logging with KLogger: experience and insights

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
A study of process arrival patterns for MPI collective operations

Proceedings of the 21st annual international conference on Supercomputing
High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters

Proceedings of the 21st annual international conference on Supercomputing
The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops)

Proceedings of the 2007 workshop on Experimental computer science
The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops)

ecs'07 Experimental computer science on Experimental computer science
Optimizing a conjugate gradient solver with non-blocking collective operations

Parallel Computing
Parallel Performance Modeling using a Genetic Programming-based Error Correction Procedure

Simulation
The blue gene/L supercomputer: a hardware and software story

International Journal of Parallel Programming
Integrated parallel performance views

Cluster Computing
Benchmarking the effects of operating system interference on extreme-scale parallel machines

Cluster Computing
Implications of application usage characteristics for collective communication offload

International Journal of High Performance Computing and Networking
NIC-based reduction algorithms for large-scale clusters

International Journal of High Performance Computing and Networking
Hard real-time performances in multiprocessor-embedded systems using ASMP-Linux

EURASIP Journal on Embedded Systems - Operating System Support for Embedded Real-Time Applications
Automatic software interference detection in parallel applications

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
The ghost in the machine: observing the effects of kernel operation on parallel application performance

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Evaluating the effect of replacing CNK with linux on the compute-nodes of blue gene/l

Proceedings of the 22nd annual international conference on Supercomputing
A regression-based approach to scalability prediction

Proceedings of the 22nd annual international conference on Supercomputing
Leveraging non-blocking collective communication in high-performance applications

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Open | SpeedShop: An open source infrastructure for parallel performance analysis

Scientific Programming - Large-Scale Programming Tools and Environments
Using server-to-server communication in parallel file systems to simplify consistency and improve performance

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Characterizing application sensitivity to OS interference using kernel-level noise injection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A dynamic scheduler for balancing HPC applications

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Performance comparison of PHP and JSP as server-side scripting languages

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
Computational forces in the SAGE benchmark

Journal of Parallel and Distributed Computing
Creating private network overlays for high performance scientific computing

Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
A study of process arrival patterns for MPI collective operations

International Journal of Parallel Programming
Strong scaling analysis of a parallel, unstructured, implicit solver and the influence of the operating system interference

Scientific Programming
Tuning parallel applications in parallel

Parallel Computing
Processor partitioning: an experimental performance analysis of parallel applications on SMP cluster systems

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Measuring causal propagation of overhead of inefficiencies in parallel applications

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Diagnosing performance bottlenecks in emerging petascale applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Evaluating similarity-based trace reduction techniques for scalable performance analysis

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scaling applications to massively parallel machines using Projections performance analysis tool

Future Generation Computer Systems
Performance feature identification by comparative trace analysis

Future Generation Computer Systems
A PAPI implementation for BlueGene

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
The Impact of noise on the scaling of collectives: the nearest neighbor model

HiPC'07 Proceedings of the 14th international conference on High performance computing
Creating private network overlays for high performance scientific computing

MIDDLEWARE2007 Proceedings of the 8th ACM/IFIP/USENIX international conference on Middleware
ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Experiences with a Lightweight Supercomputer Kernel: Lessons Learned from Blue Gene's CNK

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Characterizing the Influence of System Noise on Large-Scale Applications by Simulation

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
jitSim: a simulator for predicting scalability of parallel applications in presence of OS jitter

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Load balancing for regular meshes on SMPs with MPI

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
A look at application performance sensitivity to the bandwidth and latency of infiniband networks

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Power-performance efficiency of asymmetric multiprocessors for multi-threaded scientific applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Dynamic performance prediction of an adaptive mesh application

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Achieving strong scaling with NAMD on blue Gene/L

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Performance analysis of parallel programs via message-passing graph traversal

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Adaptive connection management for scalable MPI over InfiniBand

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Self-similarity of parallel machines

Parallel Computing
Minimal-overhead virtualization of a large scale supercomputer

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Extending and benchmarking the "Big Memory" implementation on Blue Gene/P Linux

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Predictive analysis of a hydrodynamics application on large-scale CMP clusters

Computer Science - Research and Development
An early performance analysis of POWER7-IH HPC systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The Combinatorial BLAS: design, implementation, and applications

International Journal of High Performance Computing Applications
Impact of noise on scaling of collectives: an empirical evaluation

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
A case for non-blocking collective operations

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
On the evaluation of the impact of shared resources in multithreaded COTS processors in time-critical environments

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Thread Tranquilizer: Dynamically reducing performance variation

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Challenges and issues in benchmarking MPI

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Optimizing a conjugate gradient solver with non-blocking collective operations

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
PASCOM: power model for supercomputers

ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
The impact of noise on the scaling of collectives: a theoretical approach

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Improved point-to-point and collective communication performance with output-queued high-radix routers

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Evaluation of interconnection network performance under heavy non-uniform loads

ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
Assessing MPI performance on QsNetIIt

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Optimised gather collectives on QsNetII

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Parallel job scheduling — a status report

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Early experiences with KTAU on the IBM BG/L

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
PerfMiner: cluster-wide collection, storage and presentation of application level hardware performance data

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Early experience with scientific applications on the blue gene/l supercomputer

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
T-Alloc: A practical energy efficient resource allocation algorithm for traditional data centers

Future Generation Computer Systems
Optimal task assignment in multithreaded processors: a statistical approach

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Virtual InfiniBand clusters for HPC clouds

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Trace profiling: Scalable event tracing on high-end parallel systems

Parallel Computing
Stepping towards noiseless Linux environment

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Evaluation of a high-volume, low-latency market data processing system implemented with IBM middleware

Software—Practice & Experience
Energy based performance tuning for large scale high performance computing systems

Proceedings of the 2012 Symposium on High Performance Computing
Virtualizing HPC applications using modern hypervisors

Proceedings of the 2012 workshop on Cloud services, federation, and the 8th open cirrus summit
Assessing the suitability of the NGMP multi-core processor in the space domain

Proceedings of the tenth ACM international conference on Embedded software
Application-driven analysis of two generations of capability computing: the transition to multicore processors

Concurrency and Computation: Practice & Experience
Application Performance on the Tri-Lab Linux Capacity Cluster-TLCC

International Journal of Distributed Systems and Technologies
The impact of system design parameters on application noise sensitivity

Cluster Computing
A comparative study of high-performance computing on the cloud

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
High performance cloud computing

Future Generation Computer Systems
Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Interference resilient PDES on multi-core systems: towards proportional slowdown

Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
Solving the straggler problem with bounded staleness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Understanding and isolating the noise in the Linux kernel

International Journal of High Performance Computing Applications
There goes the neighborhood: performance degradation due to nearby jobs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Using automated performance modeling to find scalability bugs in complex codes

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Enabling highly-scalable remote memory access programming with MPI-3 one sided

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Effective sampling-driven performance tools for GPU-accelerated supercomputers

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Fast pattern-specific routing for fat tree networks

ACM Transactions on Architecture and Code Optimization (TACO)
A performance comparison of current HPC systems: Blue Gene/Q, Cray XE6 and InfiniBand systems

Future Generation Computer Systems
Amdahl's law in the era of process variation

International Journal of High Performance Systems Architecture

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper we describe how we improved the effective performance of ASCI Q, the world's second-fastest supercomputer, to meet our expectations. Using an arsenal of performance-analysis techniques including analytical models, custom microbenchmarks, full applications, and simulators, we succeeded in observing a serious-but previously undetected-performance problem. We identified the source of the problem, eliminated the problem, and "closed the loop" by demonstrating up to a factor of 2 improvement in application performance. We present our methodology and provide insight into performance analysis that is immediately applicable to other large-scale supercomputers.