Fat-trees: universal networks for hardware-efficient supercomputing
IEEE Transactions on Computers
Performance Analysis of k-ary n-cube Interconnection Networks
IEEE Transactions on Computers
A performance comparison of four supercomputers
Communications of the ACM
Digital systems engineering
Communications of the ACM - Special issue on computer architecture
Imagine: Media Processing with Streams
IEEE Micro
Scalable Opto-Electronic Network (SOENet)
HOTI '02 Proceedings of the 10th Symposium on High Performance Interconnects HOT Interconnects
Exploring the VLSI Scalability of Stream Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Programmable Stream Processors
Computer
Evaluating the Imagine Stream Architecture
Proceedings of the 31st annual international symposium on Computer architecture
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
High-Throughput CORDIC-Based Geometry Operations for 3D Computer Graphics
IEEE Transactions on Computers
Analysis and Performance Results of a Molecular Modeling Application on Merrimac
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Bandwidth Management with a Reconfigurable Data Cache
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
Stream Programming on General-Purpose Processors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
ClawHMMER: A Streaming HMMer-Search Implementatio
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors
Proceedings of the International Symposium on Code Generation and Optimization
A versatile stereo implementation on commodity graphics hardware
Real-Time Imaging
Compiling for stream processing
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Graphical Models - Special issue on PG2004
The potential energy efficiency of vector acceleration
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
The design space of data-parallel memory systems
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
ALP: Efficient support for all levels of parallelism for complex media applications
ACM Transactions on Architecture and Code Optimization (TACO)
A 64-bit stream processor architecture for scientific applications
Proceedings of the 34th annual international symposium on Computer architecture
Comparing memory systems for chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Executing irregular scientific applications on stream architectures
Proceedings of the 21st annual international conference on Supercomputing
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors
Proceedings of the 21st annual international conference on Supercomputing
Journal of VLSI Signal Processing Systems
Application driven embedded system design: a face recognition case study
CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Mapping streaming architectures on reconfigurable platforms
ACM SIGARCH Computer Architecture News - Special issue on the 2006 reconfigurable and adaptive architecture workshop
Biosequence Similarity Search on the Mercury System
Journal of VLSI Signal Processing Systems
Streaming Algorithms for Biological Sequence Alignment on GPUs
IEEE Transactions on Parallel and Distributed Systems
Streamware: programming general-purpose multicore processors using streams
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Load scheduling: reducing pressure on distributed register files for free
Proceedings of the 2008 Asia and South Pacific Design Automation Conference
Optimizing scientific application loops on stream processors
Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
A Performance Model of Dense Matrix Operations on Many-Core Architectures
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Application-specific Processor Architecture: Then and Now
Journal of Signal Processing Systems
Exploiting loop-dependent stream reuse for stream processors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A tuning framework for software-managed memory hierarchies
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Comparative evaluation of memory models for chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
Using GPUs to improve multigrid solver performance on a cluster
International Journal of Computational Science and Engineering
GRAMPS: A programming model for graphics pipelines
ACM Transactions on Graphics (TOG)
Certified Reasoning in Memory Hierarchies
APLAS '08 Proceedings of the 6th Asian Symposium on Programming Languages and Systems
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor
The Journal of Supercomputing
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A comparison of programming models for multiprocessors with explicitly managed memory hierarchies
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
MPSoC Design Using Application-Specific Architecturally Visible Communication
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Streaming implementation of a sequential decompression algorithm on an FPGA
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
StreamRay: a stream filtering architecture for coherent ray tracing
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
ARC '09 Proceedings of the 5th International Workshop on Reconfigurable Computing: Architectures, Tools and Applications
A multi-streaming SIMD architecture for multimedia applications
Proceedings of the 6th ACM conference on Computing frontiers
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
ACM Transactions on Architecture and Code Optimization (TACO)
Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study
SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Nodal discontinuous Galerkin methods on graphics processors
Journal of Computational Physics
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Increasing memory miss tolerance for SIMD cores
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Conservation cores: reducing the energy of mature computations
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An analytical model to exploit memory task scheduling
Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
The Journal of Supercomputing
ICA3PP'07 Proceedings of the 7th international conference on Algorithms and architectures for parallel processing
Stream image processing on a dual-core embedded system
SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
FT64: scientific computing with streams
HiPC'07 Proceedings of the 14th international conference on High performance computing
Implementation and evaluation of Jacobi iteration on the imagine stream processor
HiPC'07 Proceedings of the 14th international conference on High performance computing
Implementation and optimization of dense LU ecomposition on the stream processor
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Implementing and optimizing a data-intensive hydrodynamics application on the stream processor
ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part III
Exploiting the reuse supplied by loop-dependent stream references for stream processors
ACM Transactions on Architecture and Code Optimization (TACO)
Understanding throughput-oriented architectures
Communications of the ACM
A multi-streaming SIMD multimedia computing engine
Microprocessors & Microsystems
Reuse-aware modulo scheduling for stream processors
Proceedings of the Conference on Design, Automation and Test in Europe
Improving scratchpad allocation with demand-driven data tiling
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
An evaluation of OpenMP on current and emerging multithreaded/multicore processors
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Throughput-Effective On-Chip Networks for Manycore Accelerators
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Memory Latency Reduction via Thread Throttling
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Streaming Data Movement for Real-Time Image Analysis
Journal of Signal Processing Systems
Landing stencil code on Godson-T
Journal of Computer Science and Technology
Loop fusion and reordering for register file optimization on stream processors
Proceedings of the 2011 ACM Symposium on Applied Computing
Scalable heterogeneous parallelism for atmospheric modeling and simulation
The Journal of Supercomputing
Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters
The Journal of Supercomputing
Flow: A Stream Processing System Simulator
PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
Proceedings of the 8th ACM International Conference on Computing Frontiers
Register allocation on stream processor with local register file
ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Optimizing modulo scheduling to achieve reuse and concurrency for stream processors
The Journal of Supercomputing
ACM Transactions on Architecture and Code Optimization (TACO)
A compile-time managed multi-level register file hierarchy
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors
ACM Transactions on Computer Systems (TOCS)
Tiled multi-core stream architecture
Transactions on High-Performance Embedded Architectures and Compilers IV
Simulation-based evaluation of the Imagine stream processor with scientific programs
International Journal of High Performance Computing and Networking
Adaptive task duplication using on-line bottleneck detection for streaming applications
Proceedings of the 9th conference on Computing Frontiers
Loop fusion and reordering for register file optimization on stream processors
Journal of Systems and Software
Characterizing and improving the use of demand-fetched caches in GPUs
Proceedings of the 26th ACM international conference on Supercomputing
Automatic generation of software pipelines for heterogeneous parallel systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scheduling streaming applications on a complex multicore platform
Concurrency and Computation: Practice & Experience
Laplace transformation on the FT64 stream processor
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Architecture-based optimization for mapping scientific applications to imagine
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Location-aware cache management for many-core processors with deep cache hierarchy
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Designing on-chip networks for throughput accelerators
ACM Transactions on Architecture and Code Optimization (TACO)
Scalability study of molecular dynamics simulation on Godson-T many-core architecture
Journal of Parallel and Distributed Computing
Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.02 |
Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a $20K 2 TFLOPS workstation to a $20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.