Scalable load-balance measurement for SPMD codes

Authors:
Todd Gamblin;Bronis R. de Supinski;Martin Schulz;Rob Fowler;Daniel A. Reed
Affiliations:
University of North Carolina at Chapel Hill;Lawrence Livermore National Laboratory;Lawrence Livermore National Laboratory;University of North Carolina at Chapel Hill;Microsoft Research
Venue:
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Year:
2008

Citing 19
Cited 11

Ten lectures on wavelets

Ten lectures on wavelets
Approaches to zerotree image and video coding on MIMD architectures

Parallel Computing - Parallel computing in image and video processing
Pentium 4 Performance-Monitoring Features

IEEE Micro
Performance Optimization for Large Scale Computing: The Scalable VAMPIR Approach

ICCS '01 Proceedings of the International Conference on Computational Science-Part II
SvPablo: A Multi-Language Architecture-Independent Performance Analysis System

ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
Scalable Line Dynamics in ParaDiS

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Portable Programming Interface for Performance Evaluation on Modern Processors

International Journal of High Performance Computing Applications
The Tau Parallel Performance System

International Journal of High Performance Computing Applications
An architecture for distributed wavelet analysis and processing in sensor networks

Proceedings of the 5th international conference on Information processing in sensor networks
Wavelet-based phase classification

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
The open trace format (OTF) and open tracing for HPC

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Scalable compression and replay of communication traces in massively parallel environments

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
PNMPI tools: a whole lot greater than the sum of their parts

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Automatic analysis of speedup of MPI applications

Proceedings of the 22nd annual international conference on Supercomputing
Overview of the Blue Gene/L system architecture

IBM Journal of Research and Development
Blue Gene/L performance tools

IBM Journal of Research and Development
Detecting phases in parallel applications on shared memory architectures

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Automatic structure extraction from MPI applications tracefiles

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Diagnosing performance bottlenecks in emerging petascale applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Evaluating similarity-based trace reduction techniques for scalable performance analysis

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Clustering performance data efficiently at massive scales

Proceedings of the 24th ACM International Conference on Supercomputing
Automatic Phase Detection and Structure Extraction of MPI Applications

International Journal of High Performance Computing Applications
Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable fine-grained call path tracing

Proceedings of the international conference on Supercomputing
Trace profiling: Scalable event tracing on high-end parallel systems

Parallel Computing
Quantifying the effectiveness of load balance algorithms

Proceedings of the 26th ACM international conference on Supercomputing
Extracting the optimal sampling frequency of applications using spectral analysis

Concurrency and Computation: Practice & Experience
Elastic and scalable tracing and accurate replay of non-deterministic events

Proceedings of the 27th international ACM conference on International conference on supercomputing
An early prototype of an autonomic performance environment for exascale

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Good load balance is crucial on very large parallel systems, but the most sophisticated algorithms introduce dynamic imbalances through adaptation in domain decomposition or use of adaptive solvers. To observe and diagnose imbalance, developers need system-wide, temporally-ordered measurements from full-scale runs. This potentially requires data collection from multiple code regions on all processors over the entire execution. Doing this instrumentation naively can, in combination with the application itself, exceed available I/O bandwidth and storage capacity, and can induce severe behavioral perturbations. We present and evaluate a novel technique for scalable, low-error load balance measurement. This uses a parallel wavelet transform and other parallel encoding methods. We show that our technique collects and reconstructs system-wide measurements with low error. Compression time scales sublinearly with system size and data volume is several orders of magnitude smaller than the raw data. The overhead is low enough for online use in a production environment.