Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications

  • Authors:
  • Luís Fabrício Góes;Christiane Pousa Ribeiro;Márcio Castro;Jean-François Méhaut;Murray Cole;Marcelo Cintra

  • Affiliations:
  • PPGEE, GSDC Group, Pontifícia Universidade Católica de Minas Gerais, Belo Horizonte, Brazil;INRIA, CEA, LIG Laboratory, Grenoble University, Grenoble, France;INRIA, CEA, LIG Laboratory, Grenoble University, Grenoble, France;INRIA, CEA, LIG Laboratory, Grenoble University, Grenoble, France;School of Informatics, ICSA, CARD Group, University of Edinburgh, Edinburgh, UK;School of Informatics, ICSA, CARD Group, University of Edinburgh, Edinburgh, UK

  • Venue:
  • International Journal of Parallel Programming
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines.