Ten lectures on wavelets
Approaches to zerotree image and video coding on MIMD architectures
Parallel Computing - Parallel computing in image and video processing
Pentium 4 Performance-Monitoring Features
IEEE Micro
Performance Optimization for Large Scale Computing: The Scalable VAMPIR Approach
ICCS '01 Proceedings of the International Conference on Computational Science-Part II
SvPablo: A Multi-Language Architecture-Independent Performance Analysis System
ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
Scalable Line Dynamics in ParaDiS
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Portable Programming Interface for Performance Evaluation on Modern Processors
International Journal of High Performance Computing Applications
The Tau Parallel Performance System
International Journal of High Performance Computing Applications
An architecture for distributed wavelet analysis and processing in sensor networks
Proceedings of the 5th international conference on Information processing in sensor networks
Wavelet-based phase classification
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
The open trace format (OTF) and open tracing for HPC
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Scalable compression and replay of communication traces in massively parallel environments
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
PNMPI tools: a whole lot greater than the sum of their parts
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Automatic analysis of speedup of MPI applications
Proceedings of the 22nd annual international conference on Supercomputing
Overview of the Blue Gene/L system architecture
IBM Journal of Research and Development
IBM Journal of Research and Development
Detecting phases in parallel applications on shared memory architectures
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Automatic structure extraction from MPI applications tracefiles
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Diagnosing performance bottlenecks in emerging petascale applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Evaluating similarity-based trace reduction techniques for scalable performance analysis
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Clustering performance data efficiently at massive scales
Proceedings of the 24th ACM International Conference on Supercomputing
Automatic Phase Detection and Structure Extraction of MPI Applications
International Journal of High Performance Computing Applications
Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable fine-grained call path tracing
Proceedings of the international conference on Supercomputing
Trace profiling: Scalable event tracing on high-end parallel systems
Parallel Computing
Quantifying the effectiveness of load balance algorithms
Proceedings of the 26th ACM international conference on Supercomputing
Extracting the optimal sampling frequency of applications using spectral analysis
Concurrency and Computation: Practice & Experience
Elastic and scalable tracing and accurate replay of non-deterministic events
Proceedings of the 27th international ACM conference on International conference on supercomputing
An early prototype of an autonomic performance environment for exascale
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Hi-index | 0.00 |
Good load balance is crucial on very large parallel systems, but the most sophisticated algorithms introduce dynamic imbalances through adaptation in domain decomposition or use of adaptive solvers. To observe and diagnose imbalance, developers need system-wide, temporally-ordered measurements from full-scale runs. This potentially requires data collection from multiple code regions on all processors over the entire execution. Doing this instrumentation naively can, in combination with the application itself, exceed available I/O bandwidth and storage capacity, and can induce severe behavioral perturbations. We present and evaluate a novel technique for scalable, low-error load balance measurement. This uses a parallel wavelet transform and other parallel encoding methods. We show that our technique collects and reconstructs system-wide measurements with low error. Compression time scales sublinearly with system size and data volume is several orders of magnitude smaller than the raw data. The overhead is low enough for online use in a production environment.