Domain-specific programmable design of scalable streaming-array for power-efficient stencil computation

Authors:
Kentaro Sano;Satoru Yamamoto;Yoshiaki Hatsuda
Affiliations:
Sciences, Tohoku University;Sciences, Tohoku University;Kobo, Co., Ltd.
Venue:
ACM SIGARCH Computer Architecture News
Year:
2011

Citing 6
Cited 0

A Cellular Automata System with FPGA

FCCM '01 Proceedings of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization

COMPSAC '09 Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference - Volume 01
Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Memory-Bandwidth

FCCM '11 Proceedings of the 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents the domain-specific programmable design of custom computing machines for high-performance stencil computation. Stencil computation is one of the typical kernels in scientific computations, however its low operational-intensity makes the sustained performance limited by memory bandwidth on recent microprocessors and GPUs. So far we have proposed a scalable streaming-array (SSA) of processing elements, which provides almost linear scalability by increasing FPGAs with a constant externalmemory bandwidth. In order to facilitate custom computing and efficiently utilize hardware resources for various and complex stencil-computations, we design programmable SSA with limited but necessary functionality. We show the design concept, the programmable structure and the SIMD instruction set for SSA. Prototype implementation with nine FPGAs demonstrates that our programmable design with a lot of floating-point units exploits hardware resources well, efficiently achieving 260 GFlop/s, which is 87.4% of the peak, at 1295 MFlop/sW.