Reducing Hot-Spot Contention in Shared-Memory Multiprocessor Systems
IEEE Concurrency
Destination-Based HoL Blocking Elimination
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
DataStager: scalable data staging services for petascale applications
Proceedings of the 18th ACM international symposium on High performance distributed computing
DataStager: scalable data staging services for petascale applications
Cluster Computing
Dynamic evolution of congestion trees: analysis and impact on switch architecture
HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
Hi-index | 0.00 |
Memory contention can be a major source of overhead in large-scale shared-memory multiprocessors. Although there are many hardware solutions to the problem of memory contention, these solutions are often complex and expensive, so software solutions are an attractive alternative. This paper evaluates one particular software solution, called block-column allocation, which is very effective at reducing memory contention for a large class of SPMD (Single-Program-Multiple-Data) programs, and can be implemented easily by the compiler. We first quantify the impact of memory contention on performance by simulating the execution of several application kernels on a large-scale multiprocessor. Our simulation results confirm that memory contention is widespread on large-scale machines; our applications suggest that contention is usually caused by synchronized access to a range of addresses (rather than to a single address). We show that block-column allocation, where each range of addresses is divided into cache lines, and each cache line is allocated to a separate memory module, can nearly eliminate this source of memory contention. As our main contribution, we compare block-column allocation to row-major allocation (a common data allocation scheme) and logarithmic broadcasting (the standard software technique for alleviating memory contention). Our analysis demonstrates the clear superiority of block-column allocation over row-major allocation in the presence of memory contention. Our analysis also indicates that the choice between block-column allocation and logarithmic broadcasting is less clear, as it depends both on the type of synchronization used and the number of processors. We can conclude, however, that on large-scale machines with hundreds of processors, block-column allocation and lock-based synchronization is the most effective combination for reducing memory contention in SPMD matrix computations.