Evaluating the Imagine Stream Architecture

  • Authors:
  • Jung Ho Ahn;William J. Dally;Brucek Khailany;Ujval J. Kapasi;Abhishek Das

  • Affiliations:
  • Stanford University, CA;Stanford University, CA;Stanford University, CA;Stanford University, CA;Stanford University, CA

  • Venue:
  • Proceedings of the 31st annual international symposium on Computer architecture
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes an experimental evaluation of theprototype Imagine stream processor. Imagine [Imagine: Media processing with streams] is a stream processor that employs a two-level register hierarchy with9.7 Kbytes of local register file capacity and 128 Kbytesof stream register file (SRF) capacity to capture producer-consumerlocality in stream applications. Parallelism is exploitedusing an array of 48 floating-point arithmetic unitsorganized as eight SIMD clusters with a 6-wide VLIW percluster. We evaluate the performance of each aspect ofthe Imagine architecture using a set of synthetic micro-benchmarks,key media processing kernels, and full applications.These micro-benchmarks show that the prototypehardware can attain 7.96 GFLOPS or 25.4 GOPS of arithmeticperformance, 12.7 Gbytes/s of SRF bandwidth, 1.58Gbytes/s of memory system bandwidth, and accept up to2 million stream processor instructions per second from ahost processor.On a set of media processing kernels, Imagine sustainedan average of 43% of peak arithmetic performance. Anevaluation of full applications provides a breakdown ofwhere execution time is spent. Over full applications, Imagineachieves 39.4% of peak performance, of the remainderon average 36.4% of time is lost due to load imbalancebetween arithmetic units in the VLIW clusters and limitedinstruction-level parallelism within kernel inner loops,10.6% is due to kernel startup and shutdown overhead becauseof short stream lengths, 7.6% is due to memory stalls,and the rest is due to insufficient host processor bandwidth.Further analysis included in the paper presents the impactof host instruction bandwidth on application performance,particularly on smaller datasets. In summary, the experimentalmeasurements described in this paper demonstratethe high performance and efficiency of stream processing:operating at 200 MHz, Imagine sustains 4.81 GFLOPS onQR decomposition while dissipating 7.42 Watts.