Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA

Authors:
Christophe Alias;Alain Darte;Alexandru Plesco
Affiliations:
Compsys, LIP, INRIA, ENS-Lyon, UCB-Lyon;Compsys, LIP, INRIA, ENS-Lyon, UCB-Lyon;Compsys, LIP, INRIA, ENS-Lyon, UCB-Lyon
Venue:
Proceedings of the Conference on Design, Automation and Test in Europe
Year:
2013

Citing 16
Cited 0

Memory size reduction through storage order optimization for embedded parallel multimedia applications

Parallel Computing - Special issue on applications: parallel processing and multimedia
Loop tiling for parallelism

Loop tiling for parallelism
Compiler-directed scratch pad memory hierarchy design and management

Proceedings of the 39th annual Design Automation Conference
Scanning Polyhedra without Do-loops

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
SPARK: A High-Lev l Synthesis Framework For Applying Parallelizing Compiler Transformations

VLSID '03 Proceedings of the 16th International Conference on VLSI Design
Effective communication coalescing for data-parallel applications

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Lattice-Based Memory Allocation

IEEE Transactions on Computers
Communication Optimizations for Fine-Grained UPC Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
DRDU: A data reuse analysis technique for efficient scratch-pad memory management

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
High-Level Synthesis: from Algorithm to Digital Circuit

High-Level Synthesis: from Algorithm to Digital Circuit
Compilation Techniques for Reconfigurable Architectures

Compilation Techniques for Reconfigurable Architectures
Finding and Applying Loop Transformations for Generating Optimized FPGA Implementations

Transactions on High-Performance Embedded Architectures and Compilers I
Precise Management of Scratchpad Memories for Localising Array Accesses in Scientific Codes

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
A reuse-aware prefetching scheme for scratchpad memory

Proceedings of the 48th Design Automation Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Some data- and compute-intensive applications can be accelerated by offloading portions of codes to platforms such as GPGPUs or FPGAs. However, to get high performance for these kernels, it is mandatory to restructure the application, to generate adequate communication mechanisms for the transfer of remote data, and to make good usage of the memory bandwidth. In the context of the high-level synthesis (HLS), from a C program, of hardware accelerators on FPGA, we show how to automatically generate optimized remote accesses for an accelerator communicating to an external DDR memory. Loop tiling is used to enable block communications, suitable for DDR memories. Pipelined communication processes are generated to overlap communications and computations, thereby hiding some latencies, in a way similar to double buffering. Finally, not only intra-tile but also inter-tile data reuse is exploited to avoid remote accesses when data are already available in the local memory. Our first contribution is to show how to generate the sets of data to be read from (resp. written to) the external memory just before (resp. after) each tile so as to reduce communications and reuse data as much as possible in the accelerator. The main difficulty arises when some data may be (re)defined in the accelerator and should be kept locally. Our second contribution is an optimized code generation scheme, entirely at source-level, i.e., in C, that allows us to compile all the necessary glue (the communication processes) with the same HLS tool as for the computation kernel. Both contributions use advanced polyhedral techniques for program analysis and transformation. Experiments with Altera HLS tools demonstrate how to use our techniques to efficiently map C kernels to FPGA.