Fat-trees: universal networks for hardware-efficient supercomputing
IEEE Transactions on Computers
Efficient conditional operations for data-parallel architectures
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Program improvement by source to source transformation
POPL '76 Proceedings of the 3rd ACM SIGACT-SIGPLAN symposium on Principles on programming languages
Exploring the VLSI Scalability of Stream Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Programmable Stream Processors
Computer
Protein Explorer: A Petaflops Special-Purpose Computer System for Molecular Dynamics Simulations
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Merrimac: Supercomputing with Streams
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Stream Register Files with Indexed Access
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Compiling for stream processing
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
The design space of data-parallel memory systems
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors
Proceedings of the 21st annual international conference on Supercomputing
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor
The Journal of Supercomputing
FT64: scientific computing with streams
HiPC'07 Proceedings of the 14th international conference on High performance computing
Implementation and evaluation of Jacobi iteration on the imagine stream processor
HiPC'07 Proceedings of the 14th international conference on High performance computing
Implementing and optimizing a data-intensive hydrodynamics application on the stream processor
ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part III
Deadlock avoidance for streaming computations with filtering
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters
The Journal of Supercomputing
Proceedings of the 8th ACM International Conference on Computing Frontiers
Scientific computing applications on the imagine stream processor
ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Laplace transformation on the FT64 stream processor
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Architecture-based optimization for mapping scientific applications to imagine
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Scalability study of molecular dynamics simulation on Godson-T many-core architecture
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
The Merrimac supercomputer uses stream processors and a high-radix network to achieve high performance at low cost and low power. The stream architecture matches the capabilities of modem semiconductor technology with compute-intensive parallel applications. We present a detailed case study of porting the GROMACS molecular-dynamics force calculation to Merrimac. The characteristics of the architecture which stress locality, parallelism, and decoupling of memory operations and computation, allow for high performance of compiler optimized code. The rich set of hardware memory operations and the ample computation bandwidth of the Merrimac processor present a wide range of algorithmic trade-offs and optimizations which may be generalized to several scientific computing domains. We use a cycle-accurate hardware simulator to analyze the performance bottlenecks of the various implementations and to measure application run-time. A comparison with the highly optimized GROMACS code, tuned for an Intel Pentium 4, confirms Merrimacýs potential to deliver high performance.