Programmable active memories: reconfigurable systems come of age
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
General-Purpose Systolic Arrays
Computer
Computational RAM: Implementing Processors in Memory
IEEE Design & Test
IEEE Micro
Intelligent RAM (IRAM): the Industrial Setting, Applications, and Architectures
ICCD '97 Proceedings of the 1997 International Conference on Computer Design (ICCD '97)
An FPGA implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm
FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
FPGA-Based Acceleration of the 3D Finite-Difference Time-Domain Method
FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Towards an RCC-based accelerator for computational fluid dynamics applications
The Journal of Supercomputing
Maxwell - a 64 FPGA Supercomputer
AHS '07 Proceedings of the Second NASA/ESA Conference on Adaptive Hardware and Systems
Systolic Architecture for Computational Fluid Dynamics on FPGAs
FCCM '07 Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Computer
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Roofline: an insightful visual performance model for multicore architectures
Communications of the ACM - A Direct Path to Dependable Software
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Hi-index | 0.00 |
This paper demonstrates and evaluates the performance and the scalability of the systolic computational-memory array (SCMA) for stencil computation, which is a typical computing kernel of scientific simulation. We describe the basic architecture of th SCMA, and show the requirements and the design of SCMAs to scalably operate over multiple devices. We implement a prototype of the SCMA with three ALTERA Stratix III FPGAs, which form a 1--3 FPGA array by conecting three DE3 boards with different clock sources. The prototype SCMA demonstrates that the difference in operating clock frequency hardly influences the total execution cycles while it slightly causes stall cycles to sub-SCMAs on different FPGAs. With three banchmark programs of typical computing kernels based on the finite difference method, we show that the increased FPGAs provide higher performance proportional to the number of devices, resulting in almost linear speedup.