Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
Abstract execution: a technique for efficiently tracing programs
Software—Practice & Experience
Quartz: a tool for tuning parallel program performance
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Run-Time Parallelization and Scheduling of Loops
IEEE Transactions on Computers
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Efficiently computing static single assignment form and the control dependence graph
ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiling Fortran D for MIMD distributed-memory machines
Communications of the ACM
Improving the performance of runtime parallelization
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
ATOM: a system for building customized program analysis tools
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Integrating parallelization strategies for linkage analysis
Computers and Biomedical Research
EEL: machine-independent executable editing
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Compiler optimizations for eliminating barrier synchronization
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Run-time methods for parallelizing partially parallel loops
ICS '95 Proceedings of the 9th international conference on Supercomputing
Online data-race detection via coherency guarantees
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Dependence Analysis for Supercomputing
Dependence Analysis for Supercomputing
False Sharing and Spatial Locality in Multiprocessor Caches
IEEE Transactions on Computers
A Unified Formalization of Four Shared-Memory Models
IEEE Transactions on Parallel and Distributed Systems
A Performance Debugger for Eliminating Excess Synchronization in Shared-Memory Parallel Programs
MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Performance measurements for multithreaded programs
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Using cause-effect analysis to understand the performance of distributed programs
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Tapeworm: high-level abstractions of shared accesses
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
A high-level abstraction of shared accesses
ACM Transactions on Computer Systems (TOCS)
Proceedings of the 14th international conference on Supercomputing
Non-Intrusive Detection of Synchronization Errors Using Execution Replay
Automated Software Engineering
Performance Tuning Software DSM Applications using Visualisation
The Journal of Supercomputing
Hi-index | 0.00 |
We describe a new approach to performance debugging that focuses on automatically identifying computation transformations to reduce synchronization and communication. By grouping writes together into equivalence classes, we are able to tractably collect information from long-running programs. Our performance debugger analyzes this information and suggests computation transformations in terms of the source code. We present the transformations suggested by the debugger on a suite of four applications. For Barnes-Hut and Shallow, implementing the debugger suggestions improved the performance by a factor of 1.32 and 34 times respectively on an 8-processor IBM SP2. For Ocean, our debugger identified excess synchronization that did not have a significant impact on performance. ILINK, a genetic linkage analysis program widely used by geneticists, is already well optimized. We use it only to demonstrate the feasibility of our approach to long-running applications.We also give details on how our approach can be implemented. We use novel techniques to convert control dependences to data dependences, and to compute the source operands of stores. We report on the impact of our instrumentation on the same application suite we use for performance debugging. The instrumentation slows down the execution by a factor of between 4 and 169 times. The log files produced during execution were all less than 2.5 Mbytes in size.