Exploring the VLSI Scalability of Stream Processors

Authors:
Brucek Khailany;William J. Dally;Scott Rixner;Ujval J. Kapasi;John D. Owens;Brian Towles
Affiliations:
-;-;-;-;-;-
Venue:
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Year:
2003

Citing 14
Cited 17

Spert-II: A Vector Microprocessor System

Computer - Special issue: neural computing: companion issue to Spring 1996 IEEE Computational Science & Engineering
Digital systems engineering

Digital systems engineering
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Efficient conditional operations for data-parallel architectures

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Stream processor architecture

Stream processor architecture
Baring It All to Software: Raw Machines

Computer
Imagine: Media Processing with Streams

IEEE Micro
Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A Stereo Machine for Video-Rate Dense Depth Mapping and Its New Applications

CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Media Processing Applications on the Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Vector microprocessors

Vector microprocessors
Scalable vector media-processors for embedded systems

Scalable vector media-processors for embedded systems

Overcoming the limitations of conventional vector processors

Proceedings of the 30th annual international symposium on Computer architecture
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
Programmable Stream Processors

Computer
Stream Processors: Progammability and Efficiency

Queue - DSPs
Design Space Exploration for Real-Time Embedded Stream Processors

IEEE Micro
Analysis and Performance Results of a Molecular Modeling Application on Merrimac

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fast Volume Segmentation With Simultaneous Visualization Using Programmable Graphics Hardware

Proceedings of the 14th IEEE Visualization 2003 (VIS'03)
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
High-throughput sketch update on a low-power stream processor

Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systems
Chessboard domination on programmable graphics hardware

Proceedings of the 44th annual Southeast regional conference
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

Proceedings of the 21st annual international conference on Supercomputing
Using GPUs to improve multigrid solver performance on a cluster

International Journal of Computational Science and Engineering
A memory interface for multi-purpose multi-stream accelerators

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Translation-invariant two-dimensional discrete wavelet transform on graphics processing units

ECS'10/ECCTD'10/ECCOM'10/ECCS'10 Proceedings of the European conference of systems, and European conference of circuits technology and devices, and European conference of communications, and European conference on Computer science
Tiled multi-core stream architecture

Transactions on High-Performance Embedded Architectures and Compilers IV

Quantified Score

Hi-index	0.01

Visualization

Abstract

Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area- and energy-efficient than conventional programmable architectures. This paper explores the scalability of stream architectures to future VLSI technologies where over a thousand floating-point units on a single chip will be feasible. Two techniques for increasing the number of ALUs in a streamprocessor are presented: intracluster and intercluster scaling. These scaling techniques are shown to be cost-efficient to tens of ALUs per cluster and to hundreds of arithmetic clusters. A 640-ALU stream processor with 128 clusters and 5 ALUs per cluster is shown to be feasible in 45 nanometer technology, sustaining over 300 GOPS on kernels and providing 15.3x of kernel speedup and 8.0x of application speedup over a 40-ALU stream processor with a 2% degradation in area per ALU and a 7% degradation in energy dissipated per ALU operation.