Managing Asynchronous Operations in Coarray Fortran 2.0

  • Authors:
  • Chaoran Yang;Karthik Murthy;John Mellor-Crummey

  • Affiliations:
  • -;-;-

  • Venue:
  • IPDPS '13 Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

As the gap between processor speed and network latency continues to increase, avoiding exposed communication latency is critical for high performance on modern supercomputers. One can hide communication latency by overlapping it with computation using non-blocking data transfers, or avoid exposing communication latency by moving computation to the location of data it manipulates. Co array Fortran 2.0 (CAF 2.0) - a partitioned global address space language - provides a rich set of asynchronous operations for avoiding exposed latency including asynchronous copies, function shipping, and asynchronous collectives. CAF 2.0 provides event variables to manage completion of asynchronous operations that use explicit completion. This paper describes CAF 2.0's finish and cofence synchronization constructs, which enable one to manage implicit completion of asynchronous operations. Finish ensures global completion of a set of asynchronous operations across the members of a team. Because of CAF 2.0's SPMD model, its semantics and implementation of finish differ significantly from those of finish in X10 and Habanero-C. cofence controls local data completion of implicitly-synchronized asynchronous operations. Together these constructs provide the ability to tune a program's performance by exploiting the difference between local data completion, local operation completion, and global completion of asynchronous operations, while hiding network latency. We explore subtle interactions between cofence, finish, events, asynchronous copies and collectives, and function shipping. We justify their presence in a relaxed memory model for CAF 2.0. We demonstrate the utility of these constructs in the context of two benchmarks: Unbalanced Tree Search (UTS), and HPC Challenge Random Access. We achieve 74%-77% parallel efficiency for 4K-32K cores for UTS using the T1WL spec, which demonstrates scalable performance using our synchronization constructs. Our cofence micro-benchmark shows that for a producer-consumer scenario, using local data completion rather than local operation completion yields superior performance.