Phaser accumulators: A new reduction construct for dynamic parallelism

Authors:
J. Shirako;D. M. Peixotto;V. Sarkar;W. N. Scherer
Affiliations:
Department of Computer Science, Rice University, USA;Department of Computer Science, Rice University, USA;Department of Computer Science, Rice University, USA;Department of Computer Science, Rice University, USA
Venue:
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Year:
2009

Citing 0
Cited 6

The habanero multicore software research project

Proceedings of the 24th ACM SIGPLAN conference companion on Object oriented programming systems languages and applications
Comparing the usability of library vs. language approaches to task parallelism

Evaluation and Usability of Programming Languages and Tools
Hiding latency in Coarray Fortran 2.0

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Dynamic split model of resource utilization in MapReduce

Proceedings of the second international workshop on Data intensive computing in the clouds
Habanero-Java: the new adventures of old X10

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
The design and implementation of clocked variables in X10

ACSC '13 Proceedings of the Thirty-Sixth Australasian Computer Science Conference - Volume 135

Quantified Score

Hi-index	0.00

Visualization

Abstract

A reduction is a computation in which a common operation, such as a sum, is to be performed across multiple pieces of data, each supplied by a separate task. We introduce phaser accumulators, a new reduction construct that meshes seamlessly with phasers to support dynamic parallelism in a phased (iterative) setting. By separating reduction computations into the parts of sending data, performing the computation itself, and retrieving the result, we enable overlap of communication and computation in a manner analogous to that of split-phase barriers. Additionally, this separation enables exploration of implementation strategies that differ as to when the reduction itself is performed: eagerly when the data is supplied, or lazily when a synchronization point is reached. We implement accumulators as extensions to phasers in the Habanero dialect of the X10 programming language. Performance evaluations of the EPCC Syncbench, Spectral-norm, and CG benchmarks on AMD Opteron, Intel Xeon, and Sun UltraSPARC T2 multicore SMPs show superior performance and scalability over OpenMP reductions (on two platforms) and X10 code (on three platforms) written with atomic blocks, with improvements of up to 2.5脳 on the Opteron and 14.9脳 on the UltraSPARC T2 relative to OpenMP and 16.5脳 on the Opteron, 26.3脳 on the Xeon and 94.8脳 on the UltraSPARC T2 relative to X10 atomic blocks. To the best of our knowledge, no prior reduction construct supports the dynamic parallelism and asynchronous capabilities of phaser accumulators.