TL-DAE: thread-level decoupled access/execution for OpenMP on the cyclops-64 many-core processor

Authors:
Ge Gan;Joseph Manzano
Affiliations:
Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware, U.S.A.;Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware, U.S.A.
Venue:
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Year:
2009

Citing 18
Cited 0

A Simulation Study of Decoupled Architecture Computers

IEEE Transactions on Computers
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Advanced compiler design and implementation

Advanced compiler design and implementation
Decoupled access/execute computer architectures

ACM Transactions on Computer Systems (TOCS)
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture

HPCS '06 Proceedings of the 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Orchestrating data transfer for the cell/B.E. processor

Proceedings of the 22nd annual international conference on Supercomputing
Scheduling multithreaded computations by work stealing

SFCS '94 Proceedings of the 35th Annual Symposium on Foundations of Computer Science
The Design of OpenMP Tasks

IEEE Transactions on Parallel and Distributed Systems
DBDB: optimizing DMATransfer for the cell be architecture

Proceedings of the 23rd international conference on Supercomputing
Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Optimizing the use of static buffers for DMA on a CELL chip

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
A compiler-based approach for dynamically managing scratch-pad memories in embedded systems

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cyclops-64 is a many-core processor with software managed memory hierarchy. For OpenMP programs running on this processor, a frequently used computing paradigm is: (i) copy data into on-chip memory; (ii) perform computations on the chip; (iii) copy results back to off-chip memory. Obviously, hiding memory copy latency is very crucial to the performance of this computing paradigm. The traditional solution is to use the asynchronous DMA transfer. However, DMA is not supported in the Cyclops-64 processor. Therefore, in this paper, we propose a software solution, called Thread-Level Decoupled Access/Execution (TL-DAE for short). It is a data-driven execution model for OpenMP programs running on the Cyclops-64 processor. The TL-DAE execution model is inspired by the canonical decoupled architecture. In our design, data movements and computations are decoupled implicitly by OpenMP compiler. At runtime, two different groups of threads are spawned: the computation threads and the percolation threads. Computation threads execute computation code while percolation threads execute data movement code. The execution of computation thread and percolation thread can slip with respect to each other, so percolation thread can run further ahead than computation thread and fetch data for it. In this paper, we will not only develop the runtime techniques used to implement the TL-DAE execution model, but also propose the required TL-DAE programming interface that is used by OpenMP compiler to generate the decoupled code. We have evaluated the TL-DAE execution model by using two OpenMP task benchmarks. Experimental results show significant performance enhancement.