Hardware and software tradeoffs for task synchronization on manycore architectures

  • Authors:
  • Yonghong Yan;Sanjay Chatterjee;Daniel A. Orozco;Elkin Garcia;Zoran Budimlić;Jun Shirako;Robert S. Pavel;Guang R. Gao;Vivek Sarkar

  • Affiliations:
  • Department of Computer Science, Rice University;Department of Computer Science, Rice University;Department of Electrical Engineering, University of Delaware;Department of Electrical Engineering, University of Delaware;Department of Computer Science, Rice University;Department of Computer Science, Rice University;Department of Electrical Engineering, University of Delaware;Department of Electrical Engineering, University of Delaware;Department of Computer Science, Rice University

  • Venue:
  • Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
  • Year:
  • 2011

Quantified Score

Hi-index 0.01

Visualization

Abstract

Manycore architectures - hundreds to thousands of cores per processor - are seen by many as a natural evolution of multicore processors. To take advantage of this massive parallelism in practice requires a productive parallel programming model, and an efficient runtime for the scheduling and coordination of concurrent tasks. A critical prerequisite for an efficient runtime is a scalable synchronization mechanism to support task coordination at different levels of granularity. This paper describes the implementation of a high-level synchronization construct called phasers on the IBM Cyclops64 manycore processor, and compares phasers to lower-level synchronization primitives currently available to Cyclops64 programmers. Phasers support synchronization of dynamic tasks by allowing tasks to register and deregister with a phaser object. It provides a general unification of point-to-point and collective synchronizations with easy-to-use interfaces, thereby offering productivity advantages over hardware primitives when used on manycores. We have experimented with several approaches to phaser implementation using software, hardware and a combination of both to explore their portability and performance. The results show that a highly-optimized phaser implementation delivered comparable performance to that obtained with lower-level synchronization primitives. We also demonstrate the success of the hardware optimizations proposed for phasers.