Evaluating the Imagine Stream Architecture

Authors:
Jung Ho Ahn;William J. Dally;Brucek Khailany;Ujval J. Kapasi;Abhishek Das
Affiliations:
Stanford University, CA;Stanford University, CA;Stanford University, CA;Stanford University, CA;Stanford University, CA
Venue:
Proceedings of the 31st annual international symposium on Computer architecture
Year:
2004

Citing 9
Cited 33

Communication scheduling

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
A real-time procedural shading system for programmable graphics hardware

Proceedings of the 28th annual conference on Computer graphics and interactive techniques
Stream processor architecture

Stream processor architecture
Imagine: Media Processing with Streams

IEEE Micro
A Stereo Machine for Video-Rate Dense Depth Mapping and Its New Applications

CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A performance analysis of PIM, stream processing, and tiled processing on memory-intensive signal processing kernels

Proceedings of the 30th annual international symposium on Computer architecture
Programmable Stream Processors

Computer
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing

Traffic shaping for an FPGA based SDRAM controller with complex QoS requirements

Proceedings of the 42nd annual Design Automation Conference
An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems

Proceedings of the 32nd annual international symposium on Computer Architecture
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A reconfigurable HW/SW platform for computation intensive high-resolution real-time digital film applications

Proceedings of the conference on Design, automation and test in Europe: Proceedings
SODA: A Low-power Architecture For Software Radio

Proceedings of the 33rd annual international symposium on Computer Architecture
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
The design space of data-parallel memory systems

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
ALP: Efficient support for all levels of parallelism for complex media applications

ACM Transactions on Architecture and Code Optimization (TACO)
Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
SODA: A High-Performance DSP Architecture for Software-Defined Radio

IEEE Micro
An Integrated Memory Array Processor for Embedded Image Recognition Systems

IEEE Transactions on Computers
A high-end real-time digital film processing reconfigurable platform

EURASIP Journal on Embedded Systems
Automatic generation of spatial and temporal memory architectures for embedded video processing systems

EURASIP Journal on Embedded Systems
INTACTE: an interconnect area, delay, and energy estimation tool for microarchitectural explorations

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Comparative evaluation of memory models for chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

The Journal of Supercomputing
MPSoC Design Using Application-Specific Architecturally Visible Communication

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
AnySP: anytime anywhere anyway signal processing

Proceedings of the 36th annual international symposium on Computer architecture
Application development with the FlexWAFE real-time stream processing architecture for FPGAs

ACM Transactions on Embedded Computing Systems (TECS)
Using a configurable processor generator for computer architecture prototyping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Conservation cores: reducing the energy of mature computations

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An MPI-Stream Hybrid Programming Model for Computational Clusters

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Mighty-morphing power-SIMD

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
A configurable framework for stream programming exploration in baseband applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Throughput-Effective On-Chip Networks for Manycore Accelerators

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A stream architecture supporting multiple stream execution models

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Simulation-based evaluation of the Imagine stream processor with scientific programs

International Journal of High Performance Computing and Networking
Laplace transformation on the FT64 stream processor

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Architecture-based optimization for mapping scientific applications to imagine

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Implementation and optimization of sparse matrix-vector multiplication on imagine stream processor

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Designing on-chip networks for throughput accelerators

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes an experimental evaluation of theprototype Imagine stream processor. Imagine [Imagine: Media processing with streams] is a stream processor that employs a two-level register hierarchy with9.7 Kbytes of local register file capacity and 128 Kbytesof stream register file (SRF) capacity to capture producer-consumerlocality in stream applications. Parallelism is exploitedusing an array of 48 floating-point arithmetic unitsorganized as eight SIMD clusters with a 6-wide VLIW percluster. We evaluate the performance of each aspect ofthe Imagine architecture using a set of synthetic micro-benchmarks,key media processing kernels, and full applications.These micro-benchmarks show that the prototypehardware can attain 7.96 GFLOPS or 25.4 GOPS of arithmeticperformance, 12.7 Gbytes/s of SRF bandwidth, 1.58Gbytes/s of memory system bandwidth, and accept up to2 million stream processor instructions per second from ahost processor.On a set of media processing kernels, Imagine sustainedan average of 43% of peak arithmetic performance. Anevaluation of full applications provides a breakdown ofwhere execution time is spent. Over full applications, Imagineachieves 39.4% of peak performance, of the remainderon average 36.4% of time is lost due to load imbalancebetween arithmetic units in the VLIW clusters and limitedinstruction-level parallelism within kernel inner loops,10.6% is due to kernel startup and shutdown overhead becauseof short stream lengths, 7.6% is due to memory stalls,and the rest is due to insufficient host processor bandwidth.Further analysis included in the paper presents the impactof host instruction bandwidth on application performance,particularly on smaller datasets. In summary, the experimentalmeasurements described in this paper demonstratethe high performance and efficiency of stream processing:operating at 200 MHz, Imagine sustains 4.81 GFLOPS onQR decomposition while dissipating 7.42 Watts.