Tiled multi-core stream architecture

  • Authors:
  • Nan Wu;Qianming Yang;Mei Wen;Yi He;Ju Ren;Maolin Guan;Chunyuan Zhang

  • Affiliations:
  • Computer School, National University of Defense Technology, Chang Sha, Hu Nan, P.R. of China;Computer School, National University of Defense Technology, Chang Sha, Hu Nan, P.R. of China;Computer School, National University of Defense Technology, Chang Sha, Hu Nan, P.R. of China;Computer School, National University of Defense Technology, Chang Sha, Hu Nan, P.R. of China;Computer School, National University of Defense Technology, Chang Sha, Hu Nan, P.R. of China;Computer School, National University of Defense Technology, Chang Sha, Hu Nan, P.R. of China;Computer School, National University of Defense Technology, Chang Sha, Hu Nan, P.R. of China

  • Venue:
  • Transactions on High-Performance Embedded Architectures and Compilers IV
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Conventional stream architectures focus on exploiting ILP and DLP in the applications, although stream model also exposes abundant TLP at kernel granularity. On the other side, with the development of model VLSI technology, increasing application demands and scalability challenges conventional stream architectures. In this paper, we present a novel Tiled Multi-Core Stream Architecture called TiSA. TiSA introduces the tile that consists of multiple stream cores as a new category of architectural resources, and designed an on-chip network to support stream transfer among tiles. In TiSA, multiple levels parallelisms are exploited on different granularity of processing elements. Besides hardware modules, this paper also discusses some other key issues of TiSA architecture, including programming model, various execution patterns and resource allocations. We then evaluate the hardware scalability of TiSA by scaling to 10s~1000s ALUs and estimating its area and delay cost. We also evaluate the software scalability of TiSA by simulating 6 stream applications and comparing sustained performance with other stream processors and general purpose processors, and different configuration of TiSA. A 256-ALU TiSA with 4 tile and 4 stream cores per tile is shown to be feasible with 45 nanometer technology, sustaining 100~350 GFLOP/s on most stream benchmarks and providing ~10x of speedup over a 16-ALU TiSA with a 5% degradation in area per ALU. The result shows that TiSA is a VLSI- and performance-efficient architecture for the billions-transistors era.