Using branch handling hardware to support profile-driven optimization
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Avoiding conflict misses dynamically in large direct-mapped caches
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving the accuracy of static branch prediction using branch correlation
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Reducing TLB and memory overhead using online superpage promotion
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Informing memory operations: providing memory performance feedback in modern processors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Operating system support for improving data locality on CC-NUMA compute servers
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Accurate and practical profile-driven compilation using the profile buffer
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Hot cold optimization of large Windows/NT applications
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Continuous profiling: where have all the cycles gone?
Proceedings of the sixteenth ACM symposium on Operating systems principles
Predicting data cache misses in non-numeric applications through correlation profiling
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor
COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Informing memory operations: memory performance feedback mechanisms and their applications
ACM Transactions on Computer Systems (TOCS)
Confidence estimation for speculation control
Proceedings of the 25th annual international symposium on Computer architecture
Variable length path branch prediction
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Concurrent Event Handling through Multithreading
IEEE Transactions on Computers
Selective cache ways: on-demand cache resource allocation
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
IEEE Transactions on Computers
A portable sampling-based profiler for Java virtual machines
Proceedings of the ACM 2000 conference on Java Grande
Performance analysis of the Alpha 21264-based Compaq ES40 system
Proceedings of the 27th annual international symposium on Computer architecture
Automated data-member layout of heap objects to improve memory-hierarchy performance
ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient and flexible value sampling
ACM SIGPLAN Notices
Relational profiling: enabling thread-level parallelism in virtual machines
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Tools for application-oriented performance tuning
ICS '01 Proceedings of the 15th international conference on Supercomputing
Efficient and flexible value sampling
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Cache decay: exploiting generational behavior to reduce cache leakage power
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Rapid profiling via stratified sampling
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Let caches decay: reducing leakage energy via exploitation of cache generational behavior
ACM Transactions on Computer Systems (TOCS)
A fast on-chip profiler memory
Proceedings of the 39th annual Design Automation Conference
Efficient instrumentation for code coverage testing
ISSTA '02 Proceedings of the 2002 ACM SIGSOFT international symposium on Software testing and analysis
HPCVIEW: A Tool for Top-down Analysis of Node Performance
The Journal of Supercomputing
Runtime Reconfiguration Techniques for Efficient General-Purpose Computation
IEEE Design & Test
Pentium 4 Performance-Monitoring Features
IEEE Micro
Probabilistic Miss Equations: Evaluating Memory Hierarchy Performance
IEEE Transactions on Computers
A Comparison of Counting and Sampling Modes of Using Performance Monitoring Hardware
ICCS '02 Proceedings of the International Conference on Computational Science-Part II
MPX: Software for Multiplexing Hardware Performance Counters in Multithreaded Programs
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Load Scheduling with Profile Information
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
SIP: Performance Tuning through Source Code Interdependence
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Deep Start: A Hybrid Strategy for Automated Performance Problem Searches
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Approximation of Worst-Case Execution Time for Preemptive Multitasking Systems
LCTES '00 Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems
Scalable analysis techniques for microprocessor performance counter metrics
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Vacuum packing: extracting hardware-detected program phases for post-link optimization
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Continuous program optimization: A case study
ACM Transactions on Programming Languages and Systems (TOPLAS)
Profiling tools for hardware/software partitioning of embedded applications
Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Catching Accurate Profiles in Hardware
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
User-level internet path diagnosis
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Frequent loop detection using efficient non-intrusive on-chip hardware
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Using Interaction Costs for Microarchitectural Bottleneck Analysis
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Targeted Path Profiling: Lower Overhead Path Profiling for Staged Dynamic Optimization Systems
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A fast on-chip profiler memory using a pipelined binary tree
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Detailed cache coherence characterization for OpenMP benchmarks
Proceedings of the 18th annual international conference on Supercomputing
iWatcher: Efficient Architectural Support for Software Debugging
Proceedings of the 31st annual international symposium on Computer architecture
Interaction cost and shotgun profiling
ACM Transactions on Architecture and Code Optimization (TACO)
AccMon: Automatically Detecting Memory-Related Bugs via Program Counter-Based Invariants
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the international symposium on Code generation and optimization
Practical Path Profiling for Dynamic Optimizers
Proceedings of the international symposium on Code generation and optimization
A Programmable Hardware Path Profiler
Proceedings of the international symposium on Code generation and optimization
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Aspect language features for concern coverage profiling
Proceedings of the 4th international conference on Aspect-oriented software development
Efficient and flexible architectural support for dynamic monitoring
ACM Transactions on Architecture and Code Optimization (TACO)
Snug set-associative caches: reducing leakage power while improving performance
ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Frequent Loop Detection Using Efficient Nonintrusive On-Chip Hardware
IEEE Transactions on Computers
Profiling soft-core processor applications for hardware/software partitioning
Journal of Systems Architecture: the EUROMICRO Journal
Online performance analysis by statistical sampling of microprocessor performance counters
Proceedings of the 19th annual international conference on Supercomputing
TAPE: a transactional application profiling environment
Proceedings of the 19th annual international conference on Supercomputing
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Efficient online computation of statement coverage
Journal of Systems and Software
Profiling over Adaptive Ranges
Proceedings of the International Symposium on Code Generation and Optimization
Area-efficient error protection for caches
Proceedings of the conference on Design, automation and test in Europe: Proceedings
Efficient remote profiling for resource-constrained devices
ACM Transactions on Architecture and Code Optimization (TACO)
Reducing Startup Time in Co-Designed Virtual Machines
Proceedings of the 33rd annual international symposium on Computer Architecture
A performance counter architecture for computing accurate CPI components
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
ACM Transactions on Architecture and Code Optimization (TACO)
An analysis of the effects of miss clustering on the cost of a cache miss
Proceedings of the 4th international conference on Computing frontiers
Identifying potential parallelism via loop-centric profiling
Proceedings of the 4th international conference on Computing frontiers
Using performance reflection in systems software
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Rapidly Selecting Good Compiler Optimizations using Performance Counters
Proceedings of the International Symposium on Code Generation and Optimization
Shadow Profiling: Hiding Instrumentation Costs with Parallelism
Proceedings of the International Symposium on Code Generation and Optimization
Source-Code-Correlated Cache Coherence Characterization of OpenMP Benchmarks
IEEE Transactions on Parallel and Distributed Systems
Proceedings of the 2007 workshop on Experimental computer science
ecs'07 Experimental computer science on Experimental computer science
Understanding and visualizing full systems with data flow tomography
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Accurate critical path prediction via random trace construction
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Formulating and implementing profiling over adaptive ranges
ACM Transactions on Architecture and Code Optimization (TACO)
Efficient program execution indexing
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Non-intrusive dynamic application profiler for detailed loop execution characterization
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Dynamic and On-Line Design Space Exploration for Reconfigurable Architectures
Transactions on High-Performance Embedded Architectures and Compilers I
Per-thread cycle accounting in SMT processors
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Core monitors: monitoring performance in multicore processors
Proceedings of the 6th ACM conference on Computing frontiers
Scenario Based Optimization: A Framework for Statically Enabling Online Optimizations
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Non-intrusive dynamic application profiling for multitasked applications
Proceedings of the 46th Annual Design Automation Conference
A systematic approach to profiling for hardware/software partitioning
Computers and Electrical Engineering
Analyzing lock contention in multithreaded applications
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Locating cache performance bottlenecks using data profiling
Proceedings of the 5th European conference on Computer systems
Modeling of DRAM power control policies using deterministic and stochastic Petri nets
PACS'02 Proceedings of the 2nd international conference on Power-aware computer systems
Taming hardware event samples for FDO compilation
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Quanto: tracking energy in networked embedded systems
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Efficient hardware-based nonintrusive dynamic application profiling
ACM Transactions on Embedded Computing Systems (TECS)
Lowering overhead in sampling-based execution monitoring and tracing
Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
DeFT: Design space exploration for on-the-fly detection of coherence misses
ACM Transactions on Architecture and Code Optimization (TACO)
RACEZ: a lightweight and non-invasive race detection tool for production applications
Proceedings of the 33rd International Conference on Software Engineering
Rapid identification of architectural bottlenecks via precise event counting
Proceedings of the 38th annual international symposium on Computer architecture
Loaf: a framework and infrastructure for creating online adaptive solutions
Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
MT-Profiler: a parallel dynamic analysis framework based on two-stage sampling
APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Hardware performance monitoring for the rest of us: a position and survey
NPC'11 Proceedings of the 8th IFIP international conference on Network and parallel computing
Proceedings of the Second Asia-Pacific Workshop on Systems
Dataflow Tomography: Information Flow Tracking For Understanding and Visualizing Full Systems
ACM Transactions on Architecture and Code Optimization (TACO)
Pinpointing data locality problems using data-centric analysis
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Profiling all paths: A new profiling technique for both cyclic and acyclic paths
Journal of Systems and Software
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
HaLock: hardware-assisted lock contention detection in multithreaded applications
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A survey and taxonomy of on-chip monitoring of multicore systems-on-chip
ACM Transactions on Design Automation of Electronic Systems (TODAES)
ACM Transactions on Embedded Computing Systems (TECS)
Fmeter: extracting indexable low-level system signatures by counting kernel function calls
Proceedings of the 13th International Middleware Conference
A survey on cache tuning from a power/energy perspective
ACM Computing Surveys (CSUR)
A data-centric profiler for parallel programs
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.02 |
Profile data is valuable for identifying performance bottlenecks and guiding optimizations. Periodic sampling of a processor's performance monitoring hardware is an effective, unobtrusive way to obtain detailed profiles. Unfortunately, existing hardware simply counts events, such as cache misses and branch mispredictions, and cannot accurately attribute these events to instructions, especially on out-of-order machines. We propose an alternative approach, called ProfileMe, that samples instructions. As a sampled instruction moves through the processor pipeline, a detailed record of all interesting events and pipeline stage latencies is collected. ProfileMe also support paired sampling, which captures information about the interactions between concurrent instructions, revealing information about useful concurrency and the utilization of various pipeline stages while an instruction is in flight. We describe an inexpensive hardware implementation of ProfileMe, outline a variety of software techniques to extract useful profile information from the hardware, and explain several ways in which this information can provide valuable feedback for programmers and optimizers.