Exploiting hardware performance counters with flow and context sensitive profiling
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A Dynamic Tracing Mechanism for Performance Analysis of OpenMP Applications
WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Construction and Compression of Complete Call Graphs for Post-Mortem Program Trace Analysis
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Toward Scalable Performance Visualization with Jumpshot
International Journal of High Performance Computing Applications
Low-overhead call path profiling of unmodified, optimized code
Proceedings of the 19th annual international conference on Supercomputing
The Tau Parallel Performance System
International Journal of High Performance Computing Applications
MPI performance analysis tools on Blue Gene/L
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
An efficient format for nearly constant-time access to arbitrary time intervals in large trace files
Scientific Programming - Large-Scale Programming Tools and Environments
Open | SpeedShop: An open source infrastructure for parallel performance analysis
Scientific Programming - Large-Scale Programming Tools and Environments
Scalable load-balance measurement for SPMD codes
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Effective performance measurement and analysis of multithreaded applications
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Binary analysis for measurement and attribution of program performance
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
ScalaTrace: Scalable compression and replay of communication traces for high-performance computing
Journal of Parallel and Distributed Computing
Automatic detection of parallel applications computation phases
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Diagnosing performance bottlenecks in emerging petascale applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Analyzing lock contention in multithreaded applications
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org
Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
The Scalasca performance toolset architecture
Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
A new vision for coarray Fortran
Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Clustering performance data efficiently at massive scales
Proceedings of the 24th ACM International Conference on Supercomputing
Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Detailed performance analysis using coarse grain sampling
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Automatic structure extraction from MPI applications tracefiles
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Quantifying the effectiveness of load balance algorithms
Proceedings of the 26th ACM international conference on Supercomputing
Novel views of performance data to analyze large-scale adaptive applications
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Effective sampling-driven performance tools for GPU-accelerated supercomputers
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimizing I/O forwarding techniques for extreme-scale event tracing
Cluster Computing
Hi-index | 0.00 |
Applications must scale well to make efficient use of even medium-scale parallel systems. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of performance bottlenecks. Although tracing is a powerful performance-analysis technique, tools that employ it can quickly become bottlenecks themselves. Moreover, to obtain actionable performance feedback for modular parallel software systems, it is often necessary to collect and present fine-grained context-sensitive data --- the very thing scalable tools avoid. While existing tracing tools can collect calling contexts, they do so only in a coarse-grained fashion; and no prior tool scalably presents both context- and time-sensitive data. This paper describes how to collect, analyze and present fine-grained call path traces for parallel programs. To scale our measurements, we use asynchronous sampling, whose granularity is controlled by a sampling frequency, and a compact representation. To present traces at multiple levels of abstraction and at arbitrary resolutions, we use sampling to render complementary slices of calling-context-sensitive trace data. Because our techniques are general, they can be used on applications that use different parallel programming models (MPI, OpenMP, PGAS). This work is implemented in HPCToolkit.