Software-Pipelining on Multi-Core Architectures

Authors:
Alban Douillet;Guang R. Gao
Affiliations:
Hewlett-Packard, USA;University of Delaware, USA
Venue:
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Year:
2007

Citing 0
Cited 5

Code-size conscious pipelining of imperfectly nested loops

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Register allocation for software pipelined multidimensional loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimizing large scale chemical transport models for multicore platforms

Proceedings of the 2008 Spring simulation multiconference
Input-driven dynamic execution prediction of streaming applications

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Pipelining for cyclic control systems

Proceedings of the 16th international conference on Hybrid systems: computation and control

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is becoming increasingly evident that multi-core chip architecture are emerging as a solution to efficiently amortizing the ever-growing number of transistors on a chip. However the success of such multi-core chips depends on the advances in system software technology, such as compiler and run-time system, in order for the application programs to exploit thread level parallelism out of originally single-threaded applications and to fully utilize the hardware on-chip concurrency. In this paper, we propose a method which, from a parallel and non-parallel imperfect loop nest written in a standard sequential language such as C or Fortran, automatically generates a multi-threaded software-pipelined schedule for multi-core architectures. The generated schedule already contains all the necessary synchronization instructions and is guaranteed free of deadlocks and buffer overflow. The feasibility of the proposed method within a modern compiler infrastructure has been verified through a pilot implementation in the Open64 compiler and tested on the IBM Cyclops multi-core architecture. Experimental results show that the performance exhibits good scalability even with 100 cores. Our light-weight synchronization mechanism minimizes the dependencies stalls and synchronization overheads without the use of dedicated hardware support.