Dynamic Task and Data Placement over NUMA Architectures: An OpenMP Runtime Perspective
IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Evaluation of OpenMP task scheduling strategies
IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
Scheduling task parallelism on multi-socket multicore systems
Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Hi-index | 0.00 |
Modern manycore processors feature a highly scalable and software-configurable cache hierarchy. For performance, manycore programmers will not only have to efficiently utilize the large number of cores but also understand and configure the cache hierarchy to suit the application. Relief from this manycore programming nightmare can be provided by task-based programming models where programmers parallelize using tasks and an architecture-specific runtime system maps tasks to cores and in addition configures the cache hierarchy. In this paper, we focus on the cache hierarchy of the Tilera TILEPro64 processor which features a software-configurable coherence waypoint called the home cache. We first show the runtime system performance bottleneck of scheduling tasks oblivious to the nature of home caches. We then demonstrate a technique in which the runtime system controls the assignment of home caches to memory blocks and schedules tasks to minimize home cache access penalties. Test results of our technique have shown a significant execution time performance improvement on selected benchmarks leading to the conclusion that by taking processor architecture features into account, task-based programming models can indeed provide continued performance and allow programmers to smoothly transit from the multicore to manycore era.