A performance tuning methodology with compiler support

Authors:
Oscar Hernandez;Barbara Chapman;Haoqiang Jin
Affiliations:
Computer Science Department, University of Houston, Houston, TX, USA. E-mails: {oscar, chapman}@cs.uh.edu;Computer Science Department, University of Houston, Houston, TX, USA. E-mails: {oscar, chapman}@cs.uh.edu;NASA Advanced Supercomputing Division, NASA Ames Research Center, Moffet Field, CA, USA. E-mail: hjin@nas.nasa.gov
Venue:
Scientific Programming - Large-Scale Programming Tools and Environments
Year:
2008

Citing 18
Cited 0

Multithreaded performance analysis with Sun WorkShop thread event analyzer

SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
SUIF Explorer: an interactive and interprocedural parallelizer

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel programming and performance evaluation with the URSA tool family

International Journal of Parallel Programming - Special issue on languages and compilers for parallel computing. Part I
A tool framework for static and dynamic analysis of object-oriented software with templates

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Performance monitoring on the PowerPC 604 microprocessor

ICCD '95 Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors
Applying Human Factors to the Design of Performance Tools

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Automatic performance analysis of hybrid MPI/OpenMP applications

Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Evolutions in parallel distributed and network-based processing
Advances in the TAU performance system

Performance analysis and grid computing
Montecito: A Dual-Core, Dual-Thread Itanium Processor

IEEE Micro
Measuring and improving application performance with PerfSuite

Linux Journal
An API for Runtime Code Patching

International Journal of High Performance Computing Applications
HPC needs a tool strategy

Proceedings of the second international workshop on Software engineering for high performance computing system applications
Developing Scientific Applications Using Eclipse

Computing in Science and Engineering
Open|SpeedShop: open source performance analysis for Linux clusters

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A study of tracing overhead on a high-performance linux cluster

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance Evaluation and Optimization of Parallel Grid Computing Applications

PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
OpenMP support in the Intel® thread checker

WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
Performance instrumentation and compiler optimizations for MPI/OpenMP applications

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have developed an environment, based upon robust, existing, open source software, for tuning applications written using MPI, OpenMP or both. The goal of this effort, which integrates the OpenUH compiler and several popular performance tools, is to increase user productivity by providing an automated, scalable performance measurement and optimization system. In this paper we describe our environment, show how these complementary tools can work together, and illustrate the synergies possible by exploiting their individual strengths and combined interactions. We also present a methodology for performance tuning that is enabled by this environment. One of the benefits of using compiler technology in this context is that it can direct the performance measurements to capture events at different levels of granularity and help assess their importance, which we have shown to significantly reduce the measurement overheads. The compiler can also help when attempting to understand the performance results: it can supply information on how a code was translated and whether optimizations were applied. Our methodology combines two performance views of the application to find bottlenecks. The first is a high level view that focuses on OpenMP/MPI performance problems such as synchronization cost and load imbalances; the second is a low level view that focuses on hardware counter analysis with derived metrics that assess the efficiency of the code. Our experiments have shown that our approach can significantly reduce overheads for both profiling and tracing to acceptable levels and limit the number of times the application needs to be run with selected hardware counters. In this paper, we demonstrate the workings of this methodology by illustrating its use with selected NAS Parallel Benchmarks and a cloud resolving code.