Fat-trees: universal networks for hardware-efficient supercomputing
IEEE Transactions on Computers
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
OpenMP: An Industry-Standard API for Shared-Memory Programming
IEEE Computational Science & Engineering
Scalability Analysis of Multidimensional Wavefront Algorithms on Large-Scale SMP Clusters
FRONTIERS '99 Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
MPI Microtask for programming the cell broadband engineTM processor
IBM Systems Journal
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
CellSs: a programming model for the cell BE architecture
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
De Novo Ultrascale Atomistic Simulations On High-End Parallel Supercomputers
International Journal of High Performance Computing Applications
Cell/B.E. blades: building blocks for scalable, real-time, interactive, and digital media servers
IBM Journal of Research and Development
Scientific computing Kernels on the cell processor
International Journal of Parallel Programming
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
The PlayStation 3 for High-Performance Scientific Computing
Computing in Science and Engineering
Entering the petaflop era: the architecture and performance of Roadrunner
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Adapting a message-driven parallel application to GPU-accelerated clusters
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on Roadrunner
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
International Journal of Parallel Programming
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Application profiling on Cell-based clusters
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
A synchronous mode MPI implementation on the cell BETM architecture
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Current technology trends favor hybrid architectures, typically with each node in a cluster containing both general-purpose and specialized accelerator processors. The typical model for programming such systems is host-centric: The general-purpose processor orchestrates the computation, offloading performancecritical work to the accelerator, and data are communicated only among general-purpose processors. In this paper, we propose a radically different hybrid-programming approach, which we call the reverse-acceleration model. In this model, the accelerators orchestrate the computation, offloading work that cannot be accelerated to the general-purpose processors. Data is communicated among accelerators, not among general-purpose processors. Our thesis is that the reverse-acceleration model simplifies porting codes to hybrid systems and facilitates performance optimization. We present a case study of a legacy neutron-transport code that we modified to use reverse acceleration and ran across the full 122,400 cores (general-purpose plus accelerator) of the Los Alamos National Laboratory Roadrunner supercomputer. Results indicate a substantial performance improvement over the unaccelerated version of the code.