Merrimac: high-performance and highly-efficient scientific computing with streams

Authors:
William J. Dally;Mattan Erez
Affiliations:
Stanford University;Stanford University
Venue:
Merrimac: high-performance and highly-efficient scientific computing with streams
Year:
2007

Citing 0
Cited 3

Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

Proceedings of the 21st annual international conference on Supercomputing
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

The Journal of Supercomputing
Tiled multi-core stream architecture

Transactions on High-Performance Embedded Architectures and Compilers IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

Advances in VLSI technology have made the raw ingredients for computation plentiful. Large numbers of fast functional units and large amounts of memory and bandwidth can be made efficient in terms of chip area, cost, and energy, however, high-performance computers realize only a small fraction of VLSI's potential. This dissertation describes the Merrimac streaming supercomputer architecture and system. Merrimac has an integrated view of the applications, software system, compiler, and architecture. We will show how this approach leads to over an order of magnitude gains in performance per unit cost, unit power, and unit floor-space for scientific applications when compared to common scientific computers designed around clusters of commodity general-purpose processors. The dissertation discusses Merrimac's stream architecture, the mapping of scientific codes to effectively run on the stream architecture, and system issues in the Merrimac supercomputer.The stream architecture is designed to take advantage of the properties of modern semiconductor technology---very high bandwidth over short distances and very high transistor counts, but limited global on-chip and off-chip bandwidths---and match them with the characteristics of scientific codes---large amounts of parallelism and access locality. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative computations by an order of magnitude or more. Hence a processing node with a fixed memory bandwidth (which is expensive) can support an order of magnitude more arithmetic units (which are inexpensive). Because each node has much greater performance (128 double-precision GFLOP/s) than a conventional microprocessor, a streaming supercomputer can achieve a given level of performance with fewer nodes, reducing costs, simplifying system management, and increasing reliability.