A Portable Programming Interface for Performance Evaluation on Modern Processors

Authors:
S. Browne;J. Dongarra;N. Garner;G. Ho;P. Mucci
Affiliations:
Computer Science Department, University of Tennessee, Knoxville, Tennessee, U.S.A.;Computer Science Department, University of Tennessee, Knoxville, and Oak Ridge Laboratory, Tennessee, U.S.A.;Computer Science Department, University of Tennessee, Knoxville, Tennessee, U.S.A.;Computer Science Department, University of Tennessee, Knoxville, Tennessee, U.S.A.;Computer Science Department, University of Tennessee, Knoxville, Tennessee, U.S.A.
Venue:
International Journal of High Performance Computing Applications
Year:
2000

Citing 2
Cited 139

Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
SvPablo: A Multi-Language Architecture-Independent Performance Analysis System

ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing

Algorithmic modifications to the Jacobi-Davidson parallel eigensolver to dynamically balance external CPU and memory load

ICS '01 Proceedings of the 15th international conference on Supercomputing
Performance monitoring of java applications

WOSP '02 Proceedings of the 3rd international workshop on Software and performance
A Comparison of Counting and Sampling Modes of Using Performance Monitoring Hardware

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
On-Line Debugging and Performance Monitoring with Barriers

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Language for the Complexity Analysis of Parallel Programs

PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
The Hardware Performance Monitor Toolkit

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Review of Performance Analysis Tools for MPI Parallel Programs

Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Distributed dynamic hash tables using IBM LAPI

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Effect of node size on the performance of cache-conscious B+-trees

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Using reconfigurability to achieve real-time profiling for hardware/software codesign

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
The design of a performance steering system for component-based grid applications

Performance analysis and grid computing
Advances in the TAU performance system

Performance analysis and grid computing
Cache Simulation Based on Runtime Instrumentation for OpenMP Applications

ANSS '04 Proceedings of the 37th annual symposium on Simulation
Predicting the performance of parallel programs

Parallel Computing
Supporting on-line distributed monitoring and debugging

On-line monitoring systems and computer tool interoperability
Method-level phase behavior in java workloads

OOPSLA '04 Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Large-eddy simulations on distributed shared memory clusters

Journal of Parallel and Distributed Computing
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
Towards a cross-platform microbenchmark suite for evaluating hardware performance counter data

Proceedings of the 2005 conference on Diversity in computing
GcpSensor: a CPU Performance Tool for Grid Environments

QSIC '05 Proceedings of the Fifth International Conference on Quality Software
Performance characterization of molecular dynamics techniques for biomolecular simulations

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
On-line automated performance diagnosis on thousands of processes

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
The Tau Parallel Performance System

International Journal of High Performance Computing Applications
Performance feature identification by comparative trace analysis

Future Generation Computer Systems
The master-slave paradigm on heterogeneous systems: a dynamic programming approach for the optimal mapping

Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Parallel, distributed and network-based processing
Extracting and improving microarchitecture performance on reconfigurable architectures

International Journal of Parallel Programming - Special issue: The next generation software program
An iterative solver benchmark

Scientific Programming
A tool for performance modeling of parallel programs

Scientific Programming
Goldilocks: a race and transaction-aware java runtime

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Performance Measurement of Novice HPC Programmers Code

SE-HPC '07 Proceedings of the 3rd International Workshop on Software Engineering for High Performance Computing Applications
Managing The Complexity Of Performance Monitoring Hardware: The Brink Andabyss Approach

International Journal of High Performance Computing Applications
Compensation of Measurement Overhead in Parallel Performance Profiling

International Journal of High Performance Computing Applications
An operation stacking framework for large ensemble computations

Proceedings of the 21st annual international conference on Supercomputing
Scaling Properties of Common Statistical Operators for Gridded Datasets

International Journal of High Performance Computing Applications
Data morphing: an adaptive, cache-conscious storage technique

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Hierarchical bin buffering: Online local moments for dynamic external memory arrays

ACM Transactions on Algorithms (TALG)
The limits of software transactional memory (STM): dissecting Haskell STM applications on a many-core environment

Proceedings of the 5th conference on Computing frontiers
Causal analysis for performance modeling of computer programs

Scientific Programming
Algorithm 880: A testing infrastructure for symmetric tridiagonal eigensolvers

ACM Transactions on Mathematical Software (TOMS)
Learning and Leveraging the Relationship between Architecture-Level Measurements and Individual User Satisfaction

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Slogger: A profiling and analysis system based on Semantic Web technologies

Scientific Programming - Large-Scale Programming Tools and Environments
Performance measurement and analysis of large-scale parallel applications on leadership computing systems

Scientific Programming - Large-Scale Programming Tools and Environments
Scalable load-balance measurement for SPMD codes

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Memory Allocation Tracing with VampirTrace

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part II
BTL++: From Performance Assessment to Optimal Libraries

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
Detection and Analysis of Iterative Behavior in Parallel Applications

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
Quick and Practical Run-Time Evaluation of Multiple Program Optimizations

Transactions on High-Performance Embedded Architectures and Compilers I
GCH: Hints for Triggering Garbage Collections

Transactions on High-Performance Embedded Architectures and Compilers I
Parallel Simulations of Dynamic Fracture Using Extrinsic Cohesive Elements

Journal of Scientific Computing
Enabling Data Structure Oriented Performance Analysis with Hardware Performance Counter Support

Euro-Par 2008 Workshops - Parallel Processing
Towards a hardware fault-injection testbed to support reproducible resiliency experiments

Proceedings of the 2009 workshop on Resiliency in high performance
A case for compiler-driven superpage allocation

Proceedings of the 47th Annual Southeast Regional Conference
Performance Profiling for OpenMP Tasks

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Recording the control flow of parallel applications to determine iterative and phase-based behavior

Future Generation Computer Systems
NIC-Assisted Cache-Efficient Receive Stack for Message Passing over Ethernet

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
On the Need for a Consortium of Capability Centers

International Journal of High Performance Computing Applications
Trees or grids?: indexing moving objects in main memory

Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Diagnosing performance bottlenecks in emerging petascale applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
An automated component-based performance experiment environment

Proceedings of the 2009 Workshop on Component-Based High Performance Computing
Capturing and analyzing the execution control flow of OpenMP applications

International Journal of Parallel Programming
Performance feature identification by comparative trace analysis

Future Generation Computer Systems
A cross-layer approach to heterogeneity and reliability

MEMOCODE'09 Proceedings of the 7th IEEE/ACM international conference on Formal Methods and Models for Codesign
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
Performance instrumentation and measurement for terascale systems

ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting numerical software and automatic tuning of heuristics

ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting numerical software and automatic tuning of heuristics

ICCS'03 Proceedings of the 2003 international conference on Computational science
OpenMP application tuning using hardware performance counters

WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
Identification of performance characteristics from multi-view trace analysis

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Operation Stacking for Ensemble Computations With Variable Convergence

International Journal of High Performance Computing Applications
Workload characterization using the TAU performance system

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Optimization of instrumentation in parallel performance evaluation tools

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
GASP! a standardized performance analysis tool interface for global address space programming models

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
A PAPI implementation for BlueGene

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Visualizing the program execution control flow of OpenMP applications

IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
Speeding up Nek5000 with autotuning and specialization

Proceedings of the 24th ACM International Conference on Supercomputing
Workload characterization for operator-based distributed stream processing applications

Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems
Optimization of a Computational Fluid Dynamics Code for the Memory Hierarchy: A Case Study

International Journal of High Performance Computing Applications
High order finite volume methods on wavelet-adapted grids with local time-stepping on multicore architectures for the simulation of shock-bubble interactions

Journal of Computational Physics
A Simulation Framework for Rapid Analysis of Reconfigurable Computing Systems

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Detailed performance analysis using coarse grain sampling

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Performance analysis of large-scale OpenMP and hybrid MPI/OpenMP applications with VampirNG

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Supporting nested OpenMP parallelism in the TAU performance system

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Model oriented profiling of parallel programs

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
The monitoring request interface (MRI)

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A framework to develop symbolic performance models of parallel applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
SelInv---An Algorithm for Selected Inversion of a Sparse Symmetric Matrix

ACM Transactions on Mathematical Software (TOMS)
Mesa: automatic generation of lookup table optimizations

Proceedings of the 4th International Workshop on Multicore Software Engineering
VM-based slack emulation of large-scale systems

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Controlling cache utilization of HPC applications

Proceedings of the international conference on Supercomputing
Leveraging reconfigurability in the hardware/software codesign process

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Exploiting dense substructures for fast sparse matrix vector multiplication

International Journal of High Performance Computing Applications
A work stealing scheduler for parallel loops on shared cache multicores

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Multicore/Multi-GPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids

SIAM Journal on Scientific Computing
Exploring thread and memory placement on NUMA architectures: solaris and linux, UltraSPARC/FirePlane and opteron/hypertransport

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Xen-OSCAR for cluster virtualization

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Overseer: low-level hardware monitoring and management for Java

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
A load balance methodology for highly compute-intensive applications on grids based on computational modeling

OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems
Performance profiling overhead compensation for MPI programs

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A practical method for quickly evaluating program optimizations

HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
Garbage collection hints

HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
Automatic data locality optimization through self-optimization

IWSOS'06/EuroNGI'06 Proceedings of the First international conference, and Proceedings of the Third international conference on New Trends in Network Architectures and Services conference on Self-Organising Systems
Hierarchical model validation of symbolic performance models of scientific kernels

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Parallel simulation of multicomponent systems

VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
A performance measurement infrastructure for co-array fortran

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Comprehensive cache inspection with hardware monitors

PaCT'05 Proceedings of the 8th international conference on Parallel Computing Technologies
Controlled experimentation with agents: models and implementations

ESAW'04 Proceedings of the 5th international conference on Engineering Societies in the Agents World
Metronome: operating system level performance management via self-adaptive computing

Proceedings of the 49th Annual Design Automation Conference
Virtual-machine-based emulation of future generation high-performance computing systems

International Journal of High Performance Computing Applications
ADP: automated diagnosis of performance pathologies using hardware events

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Quantifying the effectiveness of load balance algorithms

Proceedings of the 26th ACM international conference on Supercomputing
Extracting the optimal sampling frequency of applications using spectral analysis

Concurrency and Computation: Practice & Experience
Performance characterization of global address space applications: a case study with NWChem

Concurrency and Computation: Practice & Experience
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems

SIAM Journal on Scientific Computing
BlackjackBench: portable hardware characterization

ACM SIGMETRICS Performance Evaluation Review
Cache-efficient parallel isosurface extraction for shared cache multicores

EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
NUMA-aware graph mining techniques for performance and energy efficiency

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
On using incremental profiling for the performance analysis of shared memory parallel applications

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Detecting application load imbalance on high end massively parallel systems

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Investigating the memory characteristics of a massively parallel time warp kernel

Proceedings of the Winter Simulation Conference
Energy efficiency vs. performance of the numerical solution of PDEs: An application study on a low-power ARM-based cluster

Journal of Computational Physics
A peta-scalable CPU-GPU algorithm for global atmospheric simulations

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Numprof: a performance analysis framework for numerical libraries

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Parallel HEVC Decoding on Multi- and Many-core Architectures

Journal of Signal Processing Systems
Determination of performance characteristics of scientific applications on IBM Blue Gene/Q

IBM Journal of Research and Development
MuMMI: multiple metrics modeling infrastructure for exploring performance and power modeling

Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
Scalasca support for MPI+OpenMP parallel applications on large-scale HPC systems based on Intel Xeon Phi

Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
More IMPATIENT: A gridding-accelerated Toeplitz-based strategy for non-Cartesian high-resolution 3D MRI on GPUs

Journal of Parallel and Distributed Computing
On the usefulness of object tracking techniques in performance analysis

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Detection of false sharing using machine learning

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Exploring power behaviors and trade-offs of in-situ data analytics

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Framework for a productive performance optimization

Parallel Computing
Leakage energy estimates for HPC applications

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Tools for machine-learning-based empirical autotuning and specialization

International Journal of High Performance Computing Applications
Experiences Developing the OpenUH Compiler and Runtime Infrastructure

International Journal of Parallel Programming
Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications

International Journal of Parallel Programming
A data hiding scheme based upon DCT coefficient modification

Computer Standards & Interfaces

Quantified Score

Hi-index	0.01

Visualization

Abstract

The purpose of the PAPI project is to specify a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count events, which are occurrences of specific signals and states related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis, including hand tuning, compiler optimization, debugging, benchmarking, monitoring, and performance modeling. In addition, it is hoped that this information will prove useful in the development of new compilation technology as well as in steering architectural development toward alleviating commonly occurring bottlenecks in high performance computing.