ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
A real-time procedural shading system for programmable graphics hardware
Proceedings of the 28th annual conference on Computer graphics and interactive techniques
Stream processor architecture
Imagine: Media Processing with Streams
IEEE Micro
A Stereo Machine for Video-Rate Dense Depth Mapping and Its New Applications
CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Proceedings of the 30th annual international symposium on Computer architecture
Programmable Stream Processors
Computer
Merrimac: Supercomputing with Streams
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Traffic shaping for an FPGA based SDRAM controller with complex QoS requirements
Proceedings of the 42nd annual Design Automation Conference
An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems
Proceedings of the 32nd annual international symposium on Computer Architecture
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Proceedings of the conference on Design, automation and test in Europe: Proceedings
SODA: A Low-power Architecture For Software Radio
Proceedings of the 33rd annual international symposium on Computer Architecture
Compiling for stream processing
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
The design space of data-parallel memory systems
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
ALP: Efficient support for all levels of parallelism for complex media applications
ACM Transactions on Architecture and Code Optimization (TACO)
Comparing memory systems for chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
An Integrated Memory Array Processor for Embedded Image Recognition Systems
IEEE Transactions on Computers
A high-end real-time digital film processing reconfigurable platform
EURASIP Journal on Embedded Systems
EURASIP Journal on Embedded Systems
INTACTE: an interconnect area, delay, and energy estimation tool for microarchitectural explorations
CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Comparative evaluation of memory models for chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor
The Journal of Supercomputing
MPSoC Design Using Application-Specific Architecturally Visible Communication
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
AnySP: anytime anywhere anyway signal processing
Proceedings of the 36th annual international symposium on Computer architecture
Application development with the FlexWAFE real-time stream processing architecture for FPGAs
ACM Transactions on Embedded Computing Systems (TECS)
Using a configurable processor generator for computer architecture prototyping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Conservation cores: reducing the energy of mature computations
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An MPI-Stream Hybrid Programming Model for Computational Clusters
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
A configurable framework for stream programming exploration in baseband applications
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Throughput-Effective On-Chip Networks for Manycore Accelerators
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A stream architecture supporting multiple stream execution models
ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Simulation-based evaluation of the Imagine stream processor with scientific programs
International Journal of High Performance Computing and Networking
Laplace transformation on the FT64 stream processor
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Architecture-based optimization for mapping scientific applications to imagine
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Implementation and optimization of sparse matrix-vector multiplication on imagine stream processor
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Designing on-chip networks for throughput accelerators
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
This paper describes an experimental evaluation of theprototype Imagine stream processor. Imagine [Imagine: Media processing with streams] is a stream processor that employs a two-level register hierarchy with9.7 Kbytes of local register file capacity and 128 Kbytesof stream register file (SRF) capacity to capture producer-consumerlocality in stream applications. Parallelism is exploitedusing an array of 48 floating-point arithmetic unitsorganized as eight SIMD clusters with a 6-wide VLIW percluster. We evaluate the performance of each aspect ofthe Imagine architecture using a set of synthetic micro-benchmarks,key media processing kernels, and full applications.These micro-benchmarks show that the prototypehardware can attain 7.96 GFLOPS or 25.4 GOPS of arithmeticperformance, 12.7 Gbytes/s of SRF bandwidth, 1.58Gbytes/s of memory system bandwidth, and accept up to2 million stream processor instructions per second from ahost processor.On a set of media processing kernels, Imagine sustainedan average of 43% of peak arithmetic performance. Anevaluation of full applications provides a breakdown ofwhere execution time is spent. Over full applications, Imagineachieves 39.4% of peak performance, of the remainderon average 36.4% of time is lost due to load imbalancebetween arithmetic units in the VLIW clusters and limitedinstruction-level parallelism within kernel inner loops,10.6% is due to kernel startup and shutdown overhead becauseof short stream lengths, 7.6% is due to memory stalls,and the rest is due to insufficient host processor bandwidth.Further analysis included in the paper presents the impactof host instruction bandwidth on application performance,particularly on smaller datasets. In summary, the experimentalmeasurements described in this paper demonstratethe high performance and efficiency of stream processing:operating at 200 MHz, Imagine sustains 4.81 GFLOPS onQR decomposition while dissipating 7.42 Watts.