Analysis and Performance Results of a Molecular Modeling Application on Merrimac

Authors:
Mattan Erez;Jung Ho Ahn;Ankit Garg;William J. Dally;Eric Darve
Affiliations:
Stanford University;Stanford University;Stanford University;Stanford University;Stanford University
Venue:
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Year:
2004

Citing 9
Cited 16

Fat-trees: universal networks for hardware-efficient supercomputing

IEEE Transactions on Computers
Efficient conditional operations for data-parallel architectures

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Communication scheduling

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Program improvement by source to source transformation

POPL '76 Proceedings of the 3rd ACM SIGACT-SIGPLAN symposium on Principles on programming languages
Exploring the VLSI Scalability of Stream Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Programmable Stream Processors

Computer
Protein Explorer: A Petaflops Special-Purpose Computer System for Molecular Dynamics Simulations

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Stream Register Files with Indexed Access

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture

LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
The design space of data-parallel memory systems

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

Proceedings of the 21st annual international conference on Supercomputing
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

The Journal of Supercomputing
FT64: scientific computing with streams

HiPC'07 Proceedings of the 14th international conference on High performance computing
Implementation and evaluation of Jacobi iteration on the imagine stream processor

HiPC'07 Proceedings of the 14th international conference on High performance computing
Implementing and optimizing a data-intensive hydrodynamics application on the stream processor

ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part III
Deadlock avoidance for streaming computations with filtering

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters

The Journal of Supercomputing
Performance analysis and optimization of molecular dynamics simulation on Godson-T many-core processor

Proceedings of the 8th ACM International Conference on Computing Frontiers
Scientific computing applications on the imagine stream processor

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Laplace transformation on the FT64 stream processor

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Architecture-based optimization for mapping scientific applications to imagine

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Scalability study of molecular dynamics simulation on Godson-T many-core architecture

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Merrimac supercomputer uses stream processors and a high-radix network to achieve high performance at low cost and low power. The stream architecture matches the capabilities of modem semiconductor technology with compute-intensive parallel applications. We present a detailed case study of porting the GROMACS molecular-dynamics force calculation to Merrimac. The characteristics of the architecture which stress locality, parallelism, and decoupling of memory operations and computation, allow for high performance of compiler optimized code. The rich set of hardware memory operations and the ample computation bandwidth of the Merrimac processor present a wide range of algorithmic trade-offs and optimizations which may be generalized to several scientific computing domains. We use a cycle-accurate hardware simulator to analyze the performance bottlenecks of the various implementations and to measure application run-time. A comparison with the highly optimized GROMACS code, tuned for an Intel Pentium 4, confirms Merrimacýs potential to deliver high performance.