Fault Tolerance Techniques for the Merrimac Streaming Supercomputer
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Load scheduling: reducing pressure on distributed register files for free
Proceedings of the 2008 Asia and South Pacific Design Automation Conference
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor
The Journal of Supercomputing
Implementation and evaluation of Jacobi iteration on the imagine stream processor
HiPC'07 Proceedings of the 14th international conference on High performance computing
Implementation and optimization of dense LU ecomposition on the stream processor
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Implementing and optimizing a data-intensive hydrodynamics application on the stream processor
ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part III
Scientific computing applications on the imagine stream processor
ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Tiled multi-core stream architecture
Transactions on High-Performance Embedded Architectures and Compilers IV
Matrix-Based programming optimization for improving memory hierarchy performance on imagine
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
A stream system-on-chip architecture for high speed target recognition based on biologic vision
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Architecture-based optimization for mapping scientific applications to imagine
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
Media applications such as image processing, signal processing, and graphics require tens to hundreds of billions of arithmetic operations per second of sustained performance for real-time application rates, yet also have tight power constraints in many systems. For this reason, these applications often use special-purpose (fixed-function) processors, such as graphics processors in desk-top systems. These processors provide several orders of magnitude higher performance efficiency (performance per unit area and performance per unit power) than conventional programmable processors. In this dissertation, we present the VLSI implementation and evaluation of stream processors, which reduce this performance efficiency gap while retaining full programmability. Imagine is the first implementation of a stream processor. It contains 48 32-bit arithmetic units supporting floating-point and integer data-types organized into eight SIMD arithmetic clusters. Imagine executes applications stream programs consisting of a sequence of computation kernels operating on streams of data records. The prototype Imagine processor is a 21-million transistor chip, implemented in a 0.15 micron CMOS process. At 232 MHz, a peak performance of 9.3 GFLOPS is achieved while dissipating 6.4 Watts with a die size measuring 16 mm on a side. Furthermore, we extend these experimental results from Imagine to stream processors designed in more area- and energy-efficient custom design methodologies and to future VLSI technologies where thousands of arithmetic units on a single chip will be feasible. Two techniques for increasing the number of arithmetic units in a stream processor are presented: intracluster and intercluster scaling. These scaling techniques are shown to provide high performance efficiencies to tens of ALUs per cluster and to hundreds of arithmetic clusters, demonstrating the viability of stream processing for many years to come.