Performance debugging shared memory parallel programs using run-time dependence analysis

Authors:
Ramakrishnan Rajamony;Alan L. Cox
Affiliations:
Departments of Electrical & Computer Engineering, Rice University, Houston, TX;Departments of Electrical & Computer Science, Rice University, Houston, TX
Venue:
SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Year:
1997

Citing 22
Cited 7

Advanced compiler optimizations for supercomputers

Communications of the ACM - Special issue on parallelism
Abstract execution: a technique for efficiently tracing programs

Software—Practice & Experience
Quartz: a tool for tuning parallel program performance

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Run-Time Parallelization and Scheduling of Loops

IEEE Transactions on Computers
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Efficiently computing static single assignment form and the control dependence graph

ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiling Fortran D for MIMD distributed-memory machines

Communications of the ACM
Improving the performance of runtime parallelization

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Integrating parallelization strategies for linkage analysis

Computers and Biomedical Research
EEL: machine-independent executable editing

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Compiler optimizations for eliminating barrier synchronization

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Run-time methods for parallelizing partially parallel loops

ICS '95 Proceedings of the 9th international conference on Supercomputing
Online data-race detection via coherency guarantees

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
Tuning Memory Performance of Sequential and Parallel Programs

Computer
The Paradyn Parallel Performance Measurement Tool

Computer
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
A Unified Formalization of Four Shared-Memory Models

IEEE Transactions on Parallel and Distributed Systems
A Performance Debugger for Eliminating Excess Synchronization in Shared-Memory Parallel Programs

MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems

Performance measurements for multithreaded programs

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Using cause-effect analysis to understand the performance of distributed programs

SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Tapeworm: high-level abstractions of shared accesses

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
A high-level abstraction of shared accesses

ACM Transactions on Computer Systems (TOCS)
Performance analysis of distributed applications using automatic classification of communication inefficiencies

Proceedings of the 14th international conference on Supercomputing
Non-Intrusive Detection of Synchronization Errors Using Execution Replay

Automated Software Engineering
Performance Tuning Software DSM Applications using Visualisation

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a new approach to performance debugging that focuses on automatically identifying computation transformations to reduce synchronization and communication. By grouping writes together into equivalence classes, we are able to tractably collect information from long-running programs. Our performance debugger analyzes this information and suggests computation transformations in terms of the source code. We present the transformations suggested by the debugger on a suite of four applications. For Barnes-Hut and Shallow, implementing the debugger suggestions improved the performance by a factor of 1.32 and 34 times respectively on an 8-processor IBM SP2. For Ocean, our debugger identified excess synchronization that did not have a significant impact on performance. ILINK, a genetic linkage analysis program widely used by geneticists, is already well optimized. We use it only to demonstrate the feasibility of our approach to long-running applications.We also give details on how our approach can be implemented. We use novel techniques to convert control dependences to data dependences, and to compute the source operands of stores. We report on the impact of our instrumentation on the same application suite we use for performance debugging. The instrumentation slows down the execution by a factor of between 4 and 169 times. The log files produced during execution were all less than 2.5 Mbytes in size.