Page placement algorithms for large real-indexed caches
ACM Transactions on Computer Systems (TOCS)
Optimization of a Computational Fluid Dynamics Code for the Memory Hierarchy: A Case Study
International Journal of High Performance Computing Applications
Directly characterizing cross core interference through contention synthesis
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Memory system performance in a NUMA multicore multiprocessor
Proceedings of the 4th Annual International Conference on Systems and Storage
The impact of memory subsystem resource sharing on datacenter applications
Proceedings of the 38th annual international symposium on Computer architecture
Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
A case for NUMA-aware contention management on multicore systems
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Understanding stencil code performance on multicore architectures
Proceedings of the 8th ACM International Conference on Computing Frontiers
Overseer: low-level hardware monitoring and management for Java
Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Matching memory access patterns and data placement for NUMA systems
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Buffer-on-board memory systems
Proceedings of the 39th Annual International Symposium on Computer Architecture
Detection of false sharing using machine learning
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
L1-bandwidth aware thread allocation in multicore SMT processors
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Improving execution unit occupancy on SMT-based processors through hardware-aware thread scheduling
Future Generation Computer Systems
Hi-index | 0.01 |
Nowadays, all major processors provide a set of performance counters which capture micro-architectural level information, such as the number of elapsed cycles, cache misses, or instructions executed. Counters can be found in processor cores, processor die, chipsets, or in I/O cards. They can provide a wealth of information as to how the hardware is being used by software. Many processors now support events to measure precisely and with very limited overhead, the traffic between a core and the memory subsystem. It is possible to compute average load latency and bus band-width utilization. This valuable information can be used to improve code quality and placement of threads to maximize hardware utilization. We postulate that performance counters are the key hardware resource to locate and understand issues related to the memory subsystem. In this paper we illustrate our position by showing how certain key memory performance metrics can be gathered easily on today's hardware.