Efficient high performance collective communication for the cell blade

Authors:
Qasim Ali;Samuel P. Midkiff;Vijay S. Pai
Affiliations:
Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA
Venue:
Proceedings of the 23rd international conference on Supercomputing
Year:
2009

Citing 18
Cited 2

The Alliant FX/series: a language driven architecture of parallel processing of dusty deck Fortran

Volume I: Parallel architectures on PARLE: Parallel Architectures and Languages Europe
Fast barrier synchronization hardware

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
MagPIe: MPI's collective communication operations for clustered wide area systems

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Hardware Support for Collective Communication Operations

Proceedings of the First Heinz Nixdorf Symposium on Parallel Architectures and Their Efficient Use
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
The NYU Ultracomputer—designing a MIMD, shared-memory parallel machine (Extended Abstract)

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
Bandwidth-Efficient Collective Communication for Clustered Wide Area Systems

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Power Efficient Processor Architecture and The Cell Processor

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Performance Analysis of MPI Collective Operations

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Optimization of MPI collective communication on BlueGene/L systems

Proceedings of the 19th annual international conference on Supercomputing
Data Transfers between Processes in an SMP System: Performance Study and Application to MPI

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Optimization of collective communication in intra-cell MPI

HiPC'07 Proceedings of the 14th international conference on High performance computing
Efficient implementation of allreduce on bluegene/l collective network

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A synchronous mode MPI implementation on the cell BETM architecture

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications

Modeling advanced collective communication algorithms on cell-based systems

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Exploring a Novel Gathering Method for Finite Element Codes on the Cell/B.E. Architecture

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents high-performance collective communication algorithms and implementations that exploit the unique architectural features of the Cell heterogeneous multicore processor. This paper specifically describes novel algorithms for the barrier, broadcast, reduce, all-reduce, and all-gather collective operations, and shows the efficiency of these by comparing them to the previous fastest known implementations of these operations targeting the Cell. The new implementations are faster than the published stateof-the-art, achieving up to 19.21 times the performance (95% reduction in latency) of the previous published collective communication work for the Cell [19, 25]. The results presented show performance both within a chip and across the two Cell chips on a Cell blade [10].