Rapid identification of architectural bottlenecks via precise event counting

Authors:
John Demme;Simha Sethumadhavan
Affiliations:
Columbia University, NY, NY, USA;Columbia University, NY, NY, USA
Venue:
Proceedings of the 38th annual international symposium on Computer architecture
Year:
2011

Citing 10
Cited 15

Fast mutual exclusion for uniprocessors

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Continuous profiling: where have all the cycles gone?

ACM Transactions on Computer Systems (TOCS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads

Proceedings of the 25th annual international symposium on Computer architecture
Performance analysis using the MIPS R10000 performance counters

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Understanding and improving operating system effects in control flow prediction

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A Comparison of Counting and Sampling Modes of Using Performance Monitoring Hardware

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
A Characterization of Processor Performance in the vax-11/780

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Modeling critical sections in Amdahl's law and its implications for multicore design

Proceedings of the 37th annual international symposium on Computer architecture

Are hardware performance counters a cost effective way for integrity checking of programs

Proceedings of the sixth ACM workshop on Scalable trusted computing
Approximate graph clustering for program characterization

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
TimeWarp: rethinking timekeeping and performance monitoring mechanisms to mitigate side-channel attacks

Proceedings of the 39th Annual International Symposium on Computer Architecture
HaLock: hardware-assisted lock contention detection in multithreaded applications

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A software memory partition approach for eliminating bank-level interference in multicore systems

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Critical lock analysis: diagnosing critical section bottlenecks in multithreaded applications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A survey and taxonomy of on-chip monitoring of multicore systems-on-chip

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Production-run software failure diagnosis via hardware performance counters

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Application-level power and performance characterization and optimization on IBM Blue Gene/Q systems

IBM Journal of Research and Development
Toddler: detecting performance problems via similar memory-access patterns

Proceedings of the 2013 International Conference on Software Engineering
Bottle graphs: visualizing scalability bottlenecks in multi-threaded applications

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Leveraging the short-term memory of hardware to diagnose production-run software failures

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Improving execution unit occupancy on SMT-based processors through hardware-aware thread scheduling

Future Generation Computer Systems
HMTT: A hybrid hardware/software tracing system for bridging the DRAM access trace's semantic gap

ACM Transactions on Architecture and Code Optimization (TACO)
BPM/BPM+: Software-based dynamic memory partitioning mechanisms for mitigating DRAM bank-/channel-level interferences in multicore systems

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

On-chip performance counters play a vital role in computer architecture research due to their ability to quickly provide insights into application behaviors that are time consuming to characterize with traditional methods. The usefulness of modern performance counters, however, is limited by inefficient techniques used today to access them. Current access techniques rely on imprecise sampling or heavyweight kernel interaction forcing users to choose between precision or speed and thus restricting the use of performance counter hardware. In this paper, we describe new methods that enable precise, lightweight interfacing to on-chip performance counters. These low-overhead techniques allow precise reading of virtualized counters in low tens of nanoseconds, which is one to two orders of magnitude faster than current access techniques. Further, these tools provide several fresh insights on the behavior of modern parallel programs such as MySQL and Firefox, which were previously obscured (or impossible to obtain) by existing methods for characterization. Based on case studies with our new access methods, we discuss seven implications for computer architects in the cloud era and three methods for enhancing hardware counters further. Taken together, these observations have the potential to open up new avenues for architecture research.