A pipelined, shared resource MIMD computer
Advanced computer architecture
Concurrent programming: principles and practice
Concurrent programming: principles and practice
Compiler-generated communication for pipelined FPGA applications
Proceedings of the 40th annual Design Automation Conference
A Compile Time Based Approach for Solving Out-of-Order Communication in Kahn Process Networks
ASAP '02 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors
Coarse-Grain Pipelining on Multiple FPGA Architectures
FCCM '02 Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Solving Out-of-Order Communication in Kahn Process Networks
Journal of VLSI Signal Processing Systems
Zero cost indexing for improved processor cache performance
ACM Transactions on Design Automation of Electronic Systems (TODAES)
A Data-Driven Approach for Pipelining Sequences of Data-Dependent Loops
FCCM '07 Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Hi-index | 0.00 |
In recent years, there has been increasing interest on using task-level pipelining to accelerate the overall execution of applications mainly consisting of Producer-Consumer tasks. This paper proposes an approach to achieve pipelining execution of Producer-Consumer pairs of tasks in FPGA-based multi-core architectures. Our approach is able to speedup the overall execution of successive, data-dependent tasks, by using multiple cores and specific customization features provided by FPGAs. An important component of our approach is the use of customized inter-stage buffer schemes to communicate data and to synchronize the cores associated to the Producer-Consumer tasks. In order to improve performance, we propose a technique to optimize out-of-order Producer-Consumer pairs where the consumer uses more than once each data element produced, a behavior present in many applications (e.g., in image processing). All the schemes and optimizations proposed in this paper were evaluated with FPGA implementations. The experimental results show the feasibility of the approach in both in-order and out-of-order Producer-Consumer tasks. Furthermore, the results using our approach to task-level pipelining and a multi-core architecture reveal noticeable performance improvements for a number of benchmarks over a single core implementation without using task-level pipelining.