The effects of virtualization on main memory systems
Proceedings of the Sixth International Workshop on Data Management on New Hardware
Scalable Graph Exploration on Multicore Processors
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Memory system performance in a NUMA multicore multiprocessor
Proceedings of the 4th Annual International Conference on Systems and Storage
Proceedings of the international symposium on Memory management
Scalable aggregation on multicore processors
Proceedings of the Seventh International Workshop on Data Management on New Hardware
Reducing Network-on-Chip energy consumption through spatial locality speculation
NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
A study on factors influencing power consumption in multithreaded and multicore CPUs
WSEAS Transactions on Computers
vIOMMU: efficient IOMMU emulation
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Simultaneous multithreading on x86_64 systems: an energy efficiency evaluation
HotPower '11 Proceedings of the 4th Workshop on Power-Aware Computing and Systems
Why nothing matters: the impact of zeroing
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
E-AHRW: An Energy-Efficient Adaptive Hash Scheduler for Stream Processing on Multi-core Servers
Proceedings of the 2011 ACM/IEEE Seventh Symposium on Architectures for Networking and Communications Systems
Memory Performance And SPEC OpenMP scalability on quad-socket x86 64 systems
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
The migration prefetcher: Anticipating data promotion in dynamic NUCA caches
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Parallel main-memory indexing for moving-object query and update workloads
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Fine-grain parallelism using multi-core, Cell/BE, and GPU Systems
Parallel Computing
Efficient frequent item counting in multi-core hardware
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Reducing last level cache pollution in NUMA multicore systems for improving cache performance
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part III
Base-delta-immediate compression: practical data compression for on-chip caches
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
BlackjackBench: portable hardware characterization
ACM SIGMETRICS Performance Evaluation Review
Memory performance at reduced CPU clock speeds: an analysis of current x86_64 processors
HotPower'12 Proceedings of the 2012 USENIX conference on Power-Aware Computing and Systems
Characterizing and mitigating work time inflation in task parallel programs
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Approximate weighted matching on emerging manycore and multithreaded architectures
International Journal of High Performance Computing Applications
Optimizing tensor contraction expressions for hybrid CPU-GPU execution
Cluster Computing
NUMA-aware shared-memory collective communication for MPI
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Proceedings of the 8th ACM European Conference on Computer Systems
On understanding the energy consumption of ARM-based multicore servers
Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Understanding parallelism in graph traversal on multi-core clusters
Computer Science - Research and Development
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ARI: Adaptive LLC-memory traffic management
ACM Transactions on Architecture and Code Optimization (TACO)
Proceedings of the 5th ACM/SPEC international conference on Performance engineering
Characterizing and mitigating work time inflation in task parallel programs
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.00 |
Today's microprocessors have complex memory subsystems with several cache levels. The efficient use of this memory hierarchy is crucial to gain optimal performance, especially on multicore processors. Unfortunately, many implementation details of these processors are not publicly available. In this paper we present such fundamental details of the newly introduced Intel Nehalem microarchitecture with its integrated memory controller, Quick Path Interconnect, and ccNUMA architecture. Our analysis is based on sophisticated benchmarks to measure the latency and bandwidth between different locations in the memory subsystem. Special care is taken to control the coherency state of the data to gain insight into performance relevant implementation details of the cache coherency protocol. Based on these benchmarks we present undocumented performance data and architectural properties.