What can performance counters do for memory subsystem analysis?

Authors:
Stéphane Eranian
Affiliations:
Google, Inc
Venue:
Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
Year:
2008

Citing 1
Cited 13

Page placement algorithms for large real-indexed caches

ACM Transactions on Computer Systems (TOCS)

Optimization of a Computational Fluid Dynamics Code for the Memory Hierarchy: A Case Study

International Journal of High Performance Computing Applications
Directly characterizing cross core interference through contention synthesis

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Memory system performance in a NUMA multicore multiprocessor

Proceedings of the 4th Annual International Conference on Systems and Storage
The impact of memory subsystem resource sharing on datacenter applications

Proceedings of the 38th annual international symposium on Computer architecture
Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
A case for NUMA-aware contention management on multicore systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Understanding stencil code performance on multicore architectures

Proceedings of the 8th ACM International Conference on Computing Frontiers
Overseer: low-level hardware monitoring and management for Java

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Matching memory access patterns and data placement for NUMA systems

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Buffer-on-board memory systems

Proceedings of the 39th Annual International Symposium on Computer Architecture
Detection of false sharing using machine learning

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
L1-bandwidth aware thread allocation in multicore SMT processors

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Improving execution unit occupancy on SMT-based processors through hardware-aware thread scheduling

Future Generation Computer Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Nowadays, all major processors provide a set of performance counters which capture micro-architectural level information, such as the number of elapsed cycles, cache misses, or instructions executed. Counters can be found in processor cores, processor die, chipsets, or in I/O cards. They can provide a wealth of information as to how the hardware is being used by software. Many processors now support events to measure precisely and with very limited overhead, the traffic between a core and the memory subsystem. It is possible to compute average load latency and bus band-width utilization. This valuable information can be used to improve code quality and placement of threads to maximize hardware utilization. We postulate that performance counters are the key hardware resource to locate and understand issues related to the memory subsystem. In this paper we illustrate our position by showing how certain key memory performance metrics can be gathered easily on today's hardware.