Merrimac: Supercomputing with Streams

Authors:
William J. Dally;Francois Labonte;Abhishek Das;Patrick Hanrahan;Jung-Ho Ahn;Jayanth Gummaraju;Mattan Erez;Nuwan Jayasena;Ian Buck;Timothy J. Knight;Ujval J. Kapasi
Affiliations:
-;-;-;-;-;-;-;-;-;-;-
Venue:
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Year:
2003

Citing 9
Cited 99

Fat-trees: universal networks for hardware-efficient supercomputing

IEEE Transactions on Computers
Performance Analysis of k-ary n-cube Interconnection Networks

IEEE Transactions on Computers
A performance comparison of four supercomputers

Communications of the ACM
Digital systems engineering

Digital systems engineering
The CRAY-1 computer system

Communications of the ACM - Special issue on computer architecture
Imagine: Media Processing with Streams

IEEE Micro
Scalable Opto-Electronic Network (SOENet)

HOTI '02 Proceedings of the 10th Symposium on High Performance Interconnects HOT Interconnects
Exploring the VLSI Scalability of Stream Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Programmable Stream Processors

Computer

Evaluating the Imagine Stream Architecture

Proceedings of the 31st annual international symposium on Computer architecture
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
High-Throughput CORDIC-Based Geometry Operations for 3D Computer Graphics

IEEE Transactions on Computers
Analysis and Performance Results of a Molecular Modeling Application on Merrimac

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Bandwidth Management with a Reconfigurable Data Cache

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
Stream Programming on General-Purpose Processors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
ClawHMMER: A Streaming HMMer-Search Implementatio

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
A versatile stereo implementation on commodity graphics hardware

Real-Time Imaging
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Vertex-transformation streams

Graphical Models - Special issue on PG2004
The potential energy efficiency of vector acceleration

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
The design space of data-parallel memory systems

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
ALP: Efficient support for all levels of parallelism for complex media applications

ACM Transactions on Architecture and Code Optimization (TACO)
A 64-bit stream processor architecture for scientific applications

Proceedings of the 34th annual international symposium on Computer architecture
Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Executing irregular scientific applications on stream architectures

Proceedings of the 21st annual international conference on Supercomputing
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

Proceedings of the 21st annual international conference on Supercomputing
Performance Analysis of General-Purpose Computation on Commodity Graphics Hardware: A Case Study Using Bioinformatics

Journal of VLSI Signal Processing Systems
Application driven embedded system design: a face recognition case study

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Mapping streaming architectures on reconfigurable platforms

ACM SIGARCH Computer Architecture News - Special issue on the 2006 reconfigurable and adaptive architecture workshop
Biosequence Similarity Search on the Mercury System

Journal of VLSI Signal Processing Systems
Streaming Algorithms for Biological Sequence Alignment on GPUs

IEEE Transactions on Parallel and Distributed Systems
Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Load scheduling: reducing pressure on distributed register files for free

Proceedings of the 2008 Asia and South Pacific Design Automation Conference
Optimizing scientific application loops on stream processors

Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Recognition and Optimization of Loop-Carried Stream Reusing of Scientific Computing Applications on the Stream Processor

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
A Performance Model of Dense Matrix Operations on Many-Core Architectures

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Application-specific Processor Architecture: Then and Now

Journal of Signal Processing Systems
Exploiting loop-dependent stream reuse for stream processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A tuning framework for software-managed memory hierarchies

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Comparative evaluation of memory models for chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
Using GPUs to improve multigrid solver performance on a cluster

International Journal of Computational Science and Engineering
GRAMPS: A programming model for graphics pipelines

ACM Transactions on Graphics (TOG)
Certified Reasoning in Memory Hierarchies

APLAS '08 Proceedings of the 6th Asian Symposium on Programming Languages and Systems
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

The Journal of Supercomputing
Comparability graph coloring for optimizing utilization of stream register files in stream processors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A comparison of programming models for multiprocessors with explicitly managed memory hierarchies

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
MPSoC Design Using Application-Specific Architecturally Visible Communication

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Streaming implementation of a sequential decompression algorithm on an FPGA

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
StreamRay: a stream filtering architecture for coherent ray tracing

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Optimizing Memory Access Latencies on a Reconfigurable Multimedia Accelerator: A Case of a Turbo Product Codes Decoder

ARC '09 Proceedings of the 5th International Workshop on Reconfigurable Computing: Architectures, Tools and Applications
A multi-streaming SIMD architecture for multimedia applications

Proceedings of the 6th ACM conference on Computing frontiers
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

ACM Transactions on Architecture and Code Optimization (TACO)
Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Nodal discontinuous Galerkin methods on graphics processors

Journal of Computational Physics
Design and implementation of stream processing system and library for CELL broadband engine processors

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Conservation cores: reducing the energy of mature computations

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An analytical model to exploit memory task scheduling

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
An enhancer of memory and network for applications with large-capacity data and non-continuous data accessing

The Journal of Supercomputing
Optimizing stream organization to improve the performance of scientific computing applications on the stream processor

ICA3PP'07 Proceedings of the 7th international conference on Algorithms and architectures for parallel processing
Stream image processing on a dual-core embedded system

SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
FT64: scientific computing with streams

HiPC'07 Proceedings of the 14th international conference on High performance computing
Implementation and evaluation of Jacobi iteration on the imagine stream processor

HiPC'07 Proceedings of the 14th international conference on High performance computing
Implementation and optimization of dense LU ecomposition on the stream processor

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Implementing and optimizing a data-intensive hydrodynamics application on the stream processor

ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part III
Exploiting the reuse supplied by loop-dependent stream references for stream processors

ACM Transactions on Architecture and Code Optimization (TACO)
Understanding throughput-oriented architectures

Communications of the ACM
A multi-streaming SIMD multimedia computing engine

Microprocessors & Microsystems
Reuse-aware modulo scheduling for stream processors

Proceedings of the Conference on Design, Automation and Test in Europe
Improving scratchpad allocation with demand-driven data tiling

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
An evaluation of OpenMP on current and emerging multithreaded/multicore processors

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
FPGA implementation of a license plate recognition SoC using automatically generated streaming accelerators

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Throughput-Effective On-Chip Networks for Manycore Accelerators

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Memory Latency Reduction via Thread Throttling

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Streaming Data Movement for Real-Time Image Analysis

Journal of Signal Processing Systems
Landing stencil code on Godson-T

Journal of Computer Science and Technology
Loop fusion and reordering for register file optimization on stream processors

Proceedings of the 2011 ACM Symposium on Applied Computing
Scalable heterogeneous parallelism for atmospheric modeling and simulation

The Journal of Supercomputing
Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters

The Journal of Supercomputing
Flow: A Stream Processing System Simulator

PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
Performance analysis and optimization of molecular dynamics simulation on Godson-T many-core processor

Proceedings of the 8th ACM International Conference on Computing Frontiers
Register allocation on stream processor with local register file

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation

Parallel Computing
Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

The Journal of Supercomputing
Comparability Graph Coloring for Optimizing Utilization of Software-Managed Stream Register Files for Stream Processors

ACM Transactions on Architecture and Code Optimization (TACO)
A compile-time managed multi-level register file hierarchy

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

ACM Transactions on Computer Systems (TOCS)
Tiled multi-core stream architecture

Transactions on High-Performance Embedded Architectures and Compilers IV
Simulation-based evaluation of the Imagine stream processor with scientific programs

International Journal of High Performance Computing and Networking
Adaptive task duplication using on-line bottleneck detection for streaming applications

Proceedings of the 9th conference on Computing Frontiers
Loop fusion and reordering for register file optimization on stream processors

Journal of Systems and Software
Characterizing and improving the use of demand-fetched caches in GPUs

Proceedings of the 26th ACM international conference on Supercomputing
Towards flexible exascale stream processing system simulation

Simulation
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scheduling streaming applications on a complex multicore platform

Concurrency and Computation: Practice & Experience
Laplace transformation on the FT64 stream processor

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Architecture-based optimization for mapping scientific applications to imagine

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Location-aware cache management for many-core processors with deep cache hierarchy

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Designing on-chip networks for throughput accelerators

ACM Transactions on Architecture and Code Optimization (TACO)
Scalability study of molecular dynamics simulation on Godson-T many-core architecture

Journal of Parallel and Distributed Computing
Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.02

Visualization

Abstract

Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a $20K 2 TFLOPS workstation to a $20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.