TL-DAE: thread-level decoupled access/execution for OpenMP on the cyclops-64 many-core processor

  • Authors:
  • Ge Gan;Joseph Manzano

  • Affiliations:
  • Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware, U.S.A.;Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware, U.S.A.

  • Venue:
  • LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cyclops-64 is a many-core processor with software managed memory hierarchy. For OpenMP programs running on this processor, a frequently used computing paradigm is: (i) copy data into on-chip memory; (ii) perform computations on the chip; (iii) copy results back to off-chip memory. Obviously, hiding memory copy latency is very crucial to the performance of this computing paradigm. The traditional solution is to use the asynchronous DMA transfer. However, DMA is not supported in the Cyclops-64 processor. Therefore, in this paper, we propose a software solution, called Thread-Level Decoupled Access/Execution (TL-DAE for short). It is a data-driven execution model for OpenMP programs running on the Cyclops-64 processor. The TL-DAE execution model is inspired by the canonical decoupled architecture. In our design, data movements and computations are decoupled implicitly by OpenMP compiler. At runtime, two different groups of threads are spawned: the computation threads and the percolation threads. Computation threads execute computation code while percolation threads execute data movement code. The execution of computation thread and percolation thread can slip with respect to each other, so percolation thread can run further ahead than computation thread and fetch data for it. In this paper, we will not only develop the runtime techniques used to implement the TL-DAE execution model, but also propose the required TL-DAE programming interface that is used by OpenMP compiler to generate the decoupled code. We have evaluated the TL-DAE execution model by using two OpenMP task benchmarks. Experimental results show significant performance enhancement.