Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies

Authors:
Darren J. Kerbyson;Michael Lang;Scott Pakin
Affiliations:
Fundamentals of Computational Sciences, Pacific Northwest National Laboratory, WA 99353, USA;Computer, Computational, and Statistical Sciences, Los Alamos National Laboratory, NM 87544, USA;Computer, Computational, and Statistical Sciences, Los Alamos National Laboratory, NM 87544, USA
Venue:
Parallel Computing
Year:
2011

Citing 10
Cited 1

Scalability Analysis of Multidimensional Wavefront Algorithms on Large-Scale SMP Clusters

FRONTIERS '99 Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Cell-SWat: modeling and scheduling wavefront computations on the cell broadband engine

Proceedings of the 5th conference on Computing frontiers
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Implementation and performance modeling of deterministic particle transport (Sweep3D) on the IBM Cell/B.E.

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization

COMPSAC '09 Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference - Volume 01
The reverse-acceleration model for programming petascale hybrid systems

IBM Journal of Research and Development
Optimizing sweep3d for graphic processor unit

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I

High performance radiation transport simulations: preparing for Titan

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale systems increasingly exhibit a differential between intra-chip and inter-chip communication performance especially in hybrid systems using accelerators. Processor-cores on the same socket are able to communicate at lower latencies, and with higher bandwidths, than cores on different sockets either within the same node or between nodes. A key challenge is to efficiently use this communication hierarchy and hence optimize performance. We consider here the class of applications that contains wave-front processing. In these applications data can only be processed after their upstream neighbors have been processed. Similar dependencies result between processors in which communication is required to pass boundary data downstream and whose cost is typically impacted by the slowest communication channel in use. In this work we develop a novel hierarchical wave-front approach that reduces the use of slower communications in the hierarchy but at the cost of additional steps in the parallel computation and higher use of on-chip communications. This tradeoff is explored using a performance model. An implementation using the reverse-acceleration programming model on the petascale Roadrunner system demonstrates a 27% performance improvement at full system-scale on a kernel application. The approach is generally applicable to large-scale multi-core and accelerated systems where a differential in communication performance exists.