Balancing Locality and Parallelism on Shared-cache Mulit-core Systems

Authors:
Michael Jason Cade;Apan Qasem
Affiliations:
-;-
Venue:
HPCC '09 Proceedings of the 2009 11th IEEE International Conference on High Performance Computing and Communications
Year:
2009

Citing 0
Cited 3

Studying inter-core data reuse in multicores

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Studying inter-core data reuse in multicores

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Meeting midway: improving CMP performance with memory-side prefetching

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

The emergence of multi-core systems opens new opportunities for thread-level parallelism and dramatically increases the performance potential of applications running on these systems. However, the state of the art in performance enhancing software is far from adequate in regards to the exploitation of hardware features on this complex new architecture. As a result, much of the performance capabilities of multi-core systems are yet to be realized. This research addresses one facet of this problem by exploring the relationship between data-locality and parallelism in the context of multi-core architectures where one or more levels of cache are shared among the different cores. A model is presented for determining a profitable synchronization interval for concurrent threads that interact in a producer-consumer relationship. Experimental results suggest that consideration of the synchronization window, or the amount of work individual threads can be allowed to do between synchronizations, allows for parallelism- and locality-aware performance optimizations. The optimum synchronization window is a function of the number of threads, data reuse patterns within the workload, and the size and configuration of the last-level of cache that is shared among processing units. By considering these factors, the calculation of the optimum synchronization window incorporates parallelism and data locality issues for maximum performance.