The characterization of two scientific workloads using the CRAY X-MP performance monitor
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Performance debugging shared memory multiprocessor programs with MTOOL
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Characterizing the caching and synchronization performance of a multiprocessor operating system
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Architectural support for performance tuning: a case study on the SPARCcenter 2000
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
EEL: machine-independent executable editing
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Workload characterization using the Cray hardware performance monitor
The Journal of Supercomputing
Informing memory operations: providing memory performance feedback in modern processors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Embra: fast and flexible machine simulation
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The SHRIMP performance monitor: design and applications
SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors
MASCOTS '94 Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation On Computer and Telecommunication Systems
Advanced performance features of the 64-bit PA-8000
COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
PROTEUS: A HIGH-PERFORMANCE PARALLEL-ARCHITECTURE SIMULATOR
PROTEUS: A HIGH-PERFORMANCE PARALLEL-ARCHITECTURE SIMULATOR
Data distribution support on distributed shared memory multiprocessors
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Improving parallel shear-warp volume rendering on shared address space multiprocessors
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Continuous profiling: where have all the cycles gone?
ACM Transactions on Computer Systems (TOCS)
Continuous profiling: where have all the cycles gone?
Proceedings of the sixteenth ACM symposium on Operating systems principles
Towards efficient parallel radiosity for DSM-based parallel computers using virtual interfaces
PRS '97 Proceedings of the IEEE symposium on Parallel rendering
A methodology and an evaluation of the SGI Origin2000
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Portable profiling and tracing for parallel, scientific applications using C++
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Proceedings of the 1st international workshop on Software and performance
Critical Path Profiling of Message Passing and Shared-Memory Programs
IEEE Transactions on Parallel and Distributed Systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Scal-Tool: pinpointing and quantifying scalability bottlenecks in DSM multiprocessors
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
HLS: combining statistical and symbolic simulation to guide microprocessor designs
Proceedings of the 27th annual international symposium on Computer architecture
Using hardware performance monitors to isolate memory bottlenecks
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Analytical cache models with applications to cache partitioning
ICS '01 Proceedings of the 15th international conference on Supercomputing
Tools for application-oriented performance tuning
ICS '01 Proceedings of the 15th international conference on Supercomputing
Cache decay: exploiting generational behavior to reduce cache leakage power
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Static and Dynamic Locality Optimizations Using Integer Linear Programming
IEEE Transactions on Parallel and Distributed Systems
Compiler-Directed Collective-I/O
IEEE Transactions on Parallel and Distributed Systems
Data Relation Vectors: A New Abstraction for Data Optimizations
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Let caches decay: reducing leakage energy via exploitation of cache generational behavior
ACM Transactions on Computer Systems (TOCS)
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A fast on-chip profiler memory
Proceedings of the 39th annual Design Automation Conference
Computation regrouping: restructuring programs for temporal data cache locality
ICS '02 Proceedings of the 16th international conference on Supercomputing
Hardware-Assisted Characterization of NAS Benchmarks
Cluster Computing
HPCVIEW: A Tool for Top-down Analysis of Node Performance
The Journal of Supercomputing
Optimizing Main-Memory Join on Modern Hardware
IEEE Transactions on Knowledge and Data Engineering
Probabilistic Miss Equations: Evaluating Memory Hierarchy Performance
IEEE Transactions on Computers
Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors
ICPP '97 Proceedings of the international Conference on Parallel Processing
MPX: Software for Multiplexing Hardware Performance Counters in Multithreaded Programs
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Statistical-Empirical Hybrid Approach to Hierarchical Memory Analysis
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Set Associative Cache Behavior Optimization
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Performance Issues in Parallel Processing Systems
Performance Evaluation: Origins and Directions
SAFECOMP '01 Proceedings of the 20th International Conference on Computer Safety, Reliability and Security
SvPablo: A Multi-language Performance Analysis System
TOOLS '98 Proceedings of the 10th International Conference on Computer Performance Evaluation: Modelling Techniques and Tools
SIGMA: a simulator infrastructure to guide memory analysis
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Scalable analysis techniques for microprocessor performance counter metrics
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
C++ Expression Templates Performance Issues in Scientific Computing
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Sourcebook of parallel computing
Frequent loop detection using efficient non-intrusive on-chip hardware
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Using Interaction Costs for Microarchitectural Bottleneck Analysis
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
A compiler tool to predict memory hierarchy performance of scientific codes
Parallel Computing
Restructuring computations for temporal data cache locality
International Journal of Parallel Programming
Runtime Code Parallelization for On-Chip Multiprocessors
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Interaction cost and shotgun profiling
ACM Transactions on Architecture and Code Optimization (TACO)
Cluster miss prediction with prefetch on miss for embedded CPU instruction caches
Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
Vertical profiling: understanding the behavior of object-priented applications
OOPSLA '04 Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Data Centric Cache Measurement on the Intel ltanium 2 Processor
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Memory Profiling using Hardware Counters
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Snug set-associative caches: reducing leakage power while improving performance
ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Cache-Efficient Multigrid Algorithms
International Journal of High Performance Computing Applications
Frequent Loop Detection Using Efficient Nonintrusive On-Chip Hardware
IEEE Transactions on Computers
International Journal of Parallel Programming
Proceedings of the 41st annual Design Automation Conference
Performance and environment monitoring for continuous program optimization
IBM Journal of Research and Development
Analytical modeling of codes with arbitrary data-dependent conditional structures
Journal of Systems Architecture: the EUROMICRO Journal
Memory and Network Bandwidth Aware Scheduling of Multiprogrammed Workloads on Clusters of SMPs
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
A performance counter architecture for computing accurate CPI components
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
ACM Transactions on Architecture and Code Optimization (TACO)
Using hardware performance monitors to understand the behavior of java applications
VM'04 Proceedings of the 3rd conference on Virtual Machine Research And Technology Symposium - Volume 3
General purpose operating system support for multiple page sizes
ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference
Sampling-based program locality approximation
Proceedings of the 7th international symposium on Memory management
Hardware monitors for dynamic page migration
Journal of Parallel and Distributed Computing
Non-intrusive dynamic application profiler for detailed loop execution characterization
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Journal of Parallel and Distributed Computing
Microprocessors & Microsystems
Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion
Efficient hardware-based nonintrusive dynamic application profiling
ACM Transactions on Embedded Computing Systems (TECS)
Performance profiling of virtual machines
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
International Journal of Reconfigurable Computing - Special issue on selected papers from the 17th reconfigurable architectures workshop (RAW2010)
Rapid identification of architectural bottlenecks via precise event counting
Proceedings of the 38th annual international symposium on Computer architecture
A case for unlimited watchpoints
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
An efficient CPI stack counter architecture for superscalar processors
Proceedings of the great lakes symposium on VLSI
Hi-index | 0.01 |
Tuning supercomputer application performance often requires analyzing the interaction of the application and the underlying architecture. In this paper, we describe support in the MIPS R10000 for non-intrusively monitoring a variety of processor events -- support that is particularly useful for characterizing the dynamic behavior of multi-level memory hierarchies, hardware-based cache coherence, and speculative execution. We first explain how performance data is collected using an integrated set of hardware mechanisms, operating system abstractions, and performance tools. We then describe several examples drawn from scientific applications that illustrate how the counters and profiling tools provide information that helps developers analyze and tune applications.