Latency Hiding and Performance Tuning with Graph-Based Execution

  • Authors:
  • Pietro Cicotti;Scott B. Baden

  • Affiliations:
  • -;-

  • Venue:
  • DFM '11 Proceedings of the 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the current practice, scientific programmer and HPC users are required todevelop code that exposes a high degree of parallelism, exhibits high locality,dynamically adapts to the available resources, and hides communication latency.Hiding communication latency is crucial to realize the potential of today'sdistributed memory machines with highly parallel processing modules, andtechnological trends indicate that communication latencies will continue to bean issue as the performance gap between computation and communication widens.However, under Bulk Synchronous Parallel models, the predominant paradigm inscientific computing, scheduling is embedded into the application code. All thephases of a computation are defined and laid out as a linear sequence ofoperations limiting overlap and the program's ability to adapt to communicationdelays.In this paper we present an alternative model, called Tarragon, to overcome thelimitations of Bulk Synchronous Parallelism. Tarragon, which is based ondataflow, targets latency tolerant scientific computations. Tarragon supports atask-dependency graph abstraction in which tasks, the basic unit ofcomputation, are organized as a graph according to their data dependencies,i.e. task precedence. In addition to the task graph, Tarragon supports metadataabstractions, annotations to the task graph, to express locality informationand scheduling policies to improve performance.Tarragon's functionality and underlying programming methodology aredemonstrated on three classes of computations used in scientific domains:structured grids, sparse linear algebra, and dynamic programming. In theapplication studies, Tarragon implementations achieve high performance, in manycases exceeding the performance of equivalent latency-tolerant, hard coded MPIimplementations.