IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Multi-port abstraction layer for FPGA intensive memory exploitation applications
Journal of Systems Architecture: the EUROMICRO Journal
Automatic memory partitioning: increasing memory parallelism via data structure partitioning
CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Application specific memory access, reuse and reordering for SDRAM
ARC'11 Proceedings of the 7th international conference on Reconfigurable computing: architectures, tools and applications
Optimizing SDRAM bandwidth for custom FPGA loop accelerators
Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
VoCS'08 Proceedings of the 2008 international conference on Visions of Computer Science: BCS International Academic Conference
FPGA based efficient on-chip memory for image processing algorithms
Microelectronics Journal
Analytical synthesis of bandwidth-efficient SDRAM address generators
Microprocessors & Microsystems
Using memory profile analysis for automatic synthesis of pointers code
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.00 |
FPGA-based computing engines have become a promising option for the implementation of computationally intensive applications due to high flexibility and parallelism. However, one of the main obstacles to overcome when trying to accelerate an application on an FPGA is the bottleneck in off-chip communication, typically to large memories. Often it is known at compile-time that the same data item is accessed many times, and as a result can be loaded once from large off-chip RAM onto scarce on-chip RAM, alleviating this bottleneck. This paper addresses how to automatically derive an address mapping that reduces the size of the required on-chip memory for a given memory access pattern. Experimental results demonstrate that, in practice, our approach reduces on-chip storage requirements to the minimum, corresponding to a reduction in on-chip memory size of up to 40脳 (average 10脳) for some benchmarks compared to a naive approach. At the same time, no clock period penalty or increase in control logic area compared to this approach is observed for these benchmarks.