A scalable cross-platform infrastructure for application performance tuning using hardware counters

Authors:
S. Browne;J. Dongarra;N. Garner;K. London;P. Mucci
Affiliations:
Computer Science Dept., University of Tennessee, Knoxville;Computer Science Dept., University of Tennessee, Knoxville and Oak Ridge National Laboratory;Computer Science Dept., University of Tennessee, Knoxville;Computer Science Dept., University of Tennessee, Knoxville;Computer Science Dept., University of Tennessee, Knoxville
Venue:
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Year:
2000

Citing 2
Cited 69

Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
SvPablo: A Multi-Language Architecture-Independent Performance Analysis System

ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing

Algorithmic modifications to the Jacobi-Davidson parallel eigensolver to dynamically balance external CPU and memory load

ICS '01 Proceedings of the 15th international conference on Supercomputing
On using SCALEA for performance analysis of distributed and parallel programs

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Modeling and detecting performance problems for distributed and parallel programs with JavaPSL

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Performance Contracts: Predicting and Monitoring Grid Application Behavior

GRID '01 Proceedings of the Second International Workshop on Grid Computing
A Comparison of Counting and Sampling Modes of Using Performance Monitoring Hardware

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
CATCH - A Call-Graph Based Automatic Tool for Capture of Hardware Performance Metrics for MPI and OpenMP Applications

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
SCALEA: A Performance Analysis Tool for Distributed and Parallel Programs

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Review of Performance Analysis Tools for MPI Parallel Programs

Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Performance Analysis for MPI Applications with SCALEA

Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
SIGMA: a simulator infrastructure to guide memory analysis

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
An empirical performance evaluation of scalable scientific applications

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Scalable analysis techniques for microprocessor performance counter metrics

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Asserting performance expectations

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance optimizations and bounds for sparse matrix-vector multiply

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Parallel program performance prediction using deterministic task graph analysis

ACM Transactions on Computer Systems (TOCS)
Detailed cache coherence characterization for OpenMP benchmarks

Proceedings of the 18th annual international conference on Supercomputing
Vertical profiling: understanding the behavior of object-priented applications

OOPSLA '04 Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Memory Profiling using Hardware Counters

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
EMPS: An Environment for Memory Performance Studies

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
SFCGen: A framework for efficient generation of multi-dimensional space-filling curves by recursion

ACM Transactions on Mathematical Software (TOMS)
Fast data-locality profiling of native execution

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Diagnosing performance overheads in the xen virtual machine environment

Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments
Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
How Well Can Simple Metrics Represent the Performance of HPC Applications?

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Using Dynamic Tracing Sampling to Measure Long Running Programs

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Reliability challenges in large systems

Future Generation Computer Systems
Evaluating fragment construction policies for SDT systems

Proceedings of the 2nd international conference on Virtual execution environments
MPI performance analysis tools on Blue Gene/L

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
An intra-task dvfs technique based on statistical analysis of hardware events

Proceedings of the 4th international conference on Computing frontiers
Performance metrics and ontologies for Grid workflows

Future Generation Computer Systems
Dynamic compilation: the benefits of early investing

Proceedings of the 3rd international conference on Virtual execution environments
Data layouts for object-oriented programs

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Source-Code-Correlated Cache Coherence Characterization of OpenMP Benchmarks

IEEE Transactions on Parallel and Distributed Systems
Using hardware performance monitors to understand the behavior of java applications

VM'04 Proceedings of the 3rd conference on Virtual Machine Research And Technology Symposium - Volume 3
Operating system profiling via latency analysis

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications

Simulation
FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications

Simulation
CAMP: a common API for measuring performance

LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
Dynamic tiling for effective use of shared caches on multithreaded processors

International Journal of High Performance Computing and Networking
Processor hardware counter statistics as a first-class system resource

HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
A regression-based approach to scalability prediction

Proceedings of the 22nd annual international conference on Supercomputing
Feedback-controlled resource sharing for predictable eScience

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Online Phase-Adaptive Data Layout Selection

ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Prediction models for multi-dimensional power-performance optimization on many cores

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Scalable Implementation of Efficient Locality Approximation

Languages and Compilers for Parallel Computing
Producing wrong data without doing anything obviously wrong!

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Blind Optimization for Exploiting Hardware Features

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Algorithm, software, and hardware optimizations for Delaunay mesh generation on simultaneous multithreaded architectures

Journal of Parallel and Distributed Computing
A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Impact of Quad-Core Cray XT4 System and Software Stack on Scientific Computation

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A concurrent dynamic analysis framework for multicore hardware

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Reliability challenges in large systems

Future Generation Computer Systems
Memory hierarchy optimizations and performance bounds for sparse ATAx

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
HieraAnalyses – a tool for hierarchical analysis of parallel programs

International Journal of High Performance Systems Architecture
Efficient hardware-based nonintrusive dynamic application profiling

ACM Transactions on Embedded Computing Systems (TECS)
Should we worry about memory loss?

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
autopin: automated optimization of thread-to-core pinning on multicore systems

Transactions on high-performance embedded architectures and compilers III
Performance modeling for systematic performance tuning

State of the Practice Reports
Compiler techniques to improve dynamic branch prediction for indirect jump and call instructions

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Optimizing interpreters by tuning opcode orderings on virtual machines for modern architectures: or: how I learned to stop worrying and love hill climbing

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
PerfMiner: cluster-wide collection, storage and presentation of application level hardware performance data

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Automatic tuning of PDGEMM towards optimal performance

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A tool to display array access patterns in OpenMP programs

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
ADP: automated diagnosis of performance pathologies using hardware events

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Automatic restructuring of GPU kernels for exploiting inter-thread data locality

CC'12 Proceedings of the 21st international conference on Compiler Construction
THeME: a system for testing by hardware monitoring events

Proceedings of the 2012 International Symposium on Software Testing and Analysis
Vectorization technology to improve interpreter performance

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Refactoring and automated performance tuning of computational chemistry application codes

Proceedings of the Winter Simulation Conference
ACIC: automatic cloud I/O configurator for HPC applications

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The purpose of the PAPI project is to specify a standard API for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count “events”, which are occurrences of specific signals and states related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis and tuning. The PAPI project has proposed a standard set of hardware events and a standardcross-platform library interface to the underlying counter hardware. The PAPI library has been or is in the process of being implemented on all major HPC platforms. The PAPI project is developing end-user tools for dynamically selecting and displaying hardware counter performance data. PAPI support is also being incorporated into a number of third-party tools.