Multithreaded performance analysis with Sun WorkShop thread event analyzer
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
SUIF Explorer: an interactive and interprocedural parallelizer
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel programming and performance evaluation with the URSA tool family
International Journal of Parallel Programming - Special issue on languages and compilers for parallel computing. Part I
A tool framework for static and dynamic analysis of object-oriented software with templates
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Performance monitoring on the PowerPC 604 microprocessor
ICCD '95 Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors
Applying Human Factors to the Design of Performance Tools
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Automatic performance analysis of hybrid MPI/OpenMP applications
Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Evolutions in parallel distributed and network-based processing
Advances in the TAU performance system
Performance analysis and grid computing
An API for Runtime Code Patching
International Journal of High Performance Computing Applications
Proceedings of the second international workshop on Software engineering for high performance computing system applications
Developing Scientific Applications Using Eclipse
Computing in Science and Engineering
Open|SpeedShop: open source performance analysis for Linux clusters
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A study of tracing overhead on a high-performance linux cluster
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance Evaluation and Optimization of Parallel Grid Computing Applications
PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
OpenMP support in the Intel® thread checker
WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
Performance instrumentation and compiler optimizations for MPI/OpenMP applications
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Hi-index | 0.00 |
We have developed an environment, based upon robust, existing, open source software, for tuning applications written using MPI, OpenMP or both. The goal of this effort, which integrates the OpenUH compiler and several popular performance tools, is to increase user productivity by providing an automated, scalable performance measurement and optimization system. In this paper we describe our environment, show how these complementary tools can work together, and illustrate the synergies possible by exploiting their individual strengths and combined interactions. We also present a methodology for performance tuning that is enabled by this environment. One of the benefits of using compiler technology in this context is that it can direct the performance measurements to capture events at different levels of granularity and help assess their importance, which we have shown to significantly reduce the measurement overheads. The compiler can also help when attempting to understand the performance results: it can supply information on how a code was translated and whether optimizations were applied. Our methodology combines two performance views of the application to find bottlenecks. The first is a high level view that focuses on OpenMP/MPI performance problems such as synchronization cost and load imbalances; the second is a low level view that focuses on hardware counter analysis with derived metrics that assess the efficiency of the code. Our experiments have shown that our approach can significantly reduce overheads for both profiling and tracing to acceptable levels and limit the number of times the application needs to be run with selected hardware counters. In this paper, we demonstrate the workings of this methodology by illustrating its use with selected NAS Parallel Benchmarks and a cloud resolving code.