Algorithmic skeletons: structured management of parallel computation
Algorithmic skeletons: structured management of parallel computation
Speculative precomputation: long-range prefetching of delinquent loads
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
A Portable Programming Interface for Performance Evaluation on Modern Processors
International Journal of High Performance Computing Applications
Design and Implementation of a Compiler Framework for Helper Threading on Multi-core Processors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
The OpenTM Transactional Application Programming Interface
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Dynamic performance tuning of word-based software transactional memory
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
A view of the parallel computing landscape
Communications of the ACM - A View of Parallel Computing
Employing Transactional Memory and Helper Threads to Speedup Dijkstra's Algorithm
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications
PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
Handling the problems and opportunities posed by multiple on-chip memory controllers
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Structured parallel programming with deterministic patterns
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Process variation aware thread mapping for chip multiprocessors
Proceedings of the Conference on Design, Automation and Test in Europe
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Evaluating Thread Placement Based on Memory Access Patterns for Multi-core Processors
HPCC '10 Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and Communications
Improving memory affinity of geophysics applications on NUMA platforms using minas
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
A machine learning-based approach for thread mapping on transactional memory applications
HIPC '11 Proceedings of the 2011 18th International Conference on High Performance Computing
Dynamic thread mapping based on machine learning for transactional memory applications
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Autotuning Skeleton-Driven Optimizations for Transactional Worklist Applications
IEEE Transactions on Parallel and Distributed Systems
Hi-index | 0.00 |
Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines.