A Simulation Study of Decoupled Architecture Computers
IEEE Transactions on Computers
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Advanced compiler design and implementation
Advanced compiler design and implementation
Decoupled access/execute computer architectures
ACM Transactions on Computer Systems (TOCS)
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture
HPCS '06 Proceedings of the 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Orchestrating data transfer for the cell/B.E. processor
Proceedings of the 22nd annual international conference on Supercomputing
Scheduling multithreaded computations by work stealing
SFCS '94 Proceedings of the 35th Annual Symposium on Foundations of Computer Science
IEEE Transactions on Parallel and Distributed Systems
DBDB: optimizing DMATransfer for the cell be architecture
Proceedings of the 23rd international conference on Supercomputing
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Optimizing the use of static buffers for DMA on a CELL chip
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
A compiler-based approach for dynamically managing scratch-pad memories in embedded systems
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Hi-index | 0.00 |
Cyclops-64 is a many-core processor with software managed memory hierarchy. For OpenMP programs running on this processor, a frequently used computing paradigm is: (i) copy data into on-chip memory; (ii) perform computations on the chip; (iii) copy results back to off-chip memory. Obviously, hiding memory copy latency is very crucial to the performance of this computing paradigm. The traditional solution is to use the asynchronous DMA transfer. However, DMA is not supported in the Cyclops-64 processor. Therefore, in this paper, we propose a software solution, called Thread-Level Decoupled Access/Execution (TL-DAE for short). It is a data-driven execution model for OpenMP programs running on the Cyclops-64 processor. The TL-DAE execution model is inspired by the canonical decoupled architecture. In our design, data movements and computations are decoupled implicitly by OpenMP compiler. At runtime, two different groups of threads are spawned: the computation threads and the percolation threads. Computation threads execute computation code while percolation threads execute data movement code. The execution of computation thread and percolation thread can slip with respect to each other, so percolation thread can run further ahead than computation thread and fetch data for it. In this paper, we will not only develop the runtime techniques used to implement the TL-DAE execution model, but also propose the required TL-DAE programming interface that is used by OpenMP compiler to generate the decoupled code. We have evaluated the TL-DAE execution model by using two OpenMP task benchmarks. Experimental results show significant performance enhancement.