A synchronous mode MPI implementation on the cell BETM architecture

Authors:
Murali Krishna;Arun Kumar;Naresh Jayam;Ganapathy Senthilkumar;Pallav K. Baruah;Raghunath Sharma;Shakti Kapoor;Ashok Srinivasan
Affiliations:
Dept. of Mathematics and Computer Science, Sri Sathya Sai University, Prashanthi Nilayam, India;Dept. of Mathematics and Computer Science, Sri Sathya Sai University, Prashanthi Nilayam, India;Dept. of Mathematics and Computer Science, Sri Sathya Sai University, Prashanthi Nilayam, India;Dept. of Mathematics and Computer Science, Sri Sathya Sai University, Prashanthi Nilayam, India;Dept. of Mathematics and Computer Science, Sri Sathya Sai University, Prashanthi Nilayam, India;Dept. of Mathematics and Computer Science, Sri Sathya Sai University, Prashanthi Nilayam, India;IBM, Austin;Dept. of Computer Science, Florida State University
Venue:
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Year:
2007

Citing 13
Cited 5

A high-performance MPI implementation on a shared-memory vector supercomputer

Parallel Computing
Program transformation and runtime support for threaded MPI execution on shared-memory machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
LiMIC: Support for High-Performance MPI Intra-node Communication on Linux Cluster

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Design and Evaluation of Nemesis, a Scalable, Low-Latency, Message-Passing Communication Subsystem

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
MPI Microtask for programming the cell broadband engineTM processor

IBM Systems Journal
Data Transfers between Processes in an SMP System: Performance Study and Application to MPI

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Feasibility study of MPI implementation on the heterogeneous multi-core cell BE™ architecture

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
A Buffered-Mode MPI Implementation for the Cell BETM Processor

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Implementation and shared-memory evaluation of MPICH2 over the nemesis communication subsystem

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface

Efficient high performance collective communication for the cell blade

Proceedings of the 23rd international conference on Supercomputing
The bottom-up implementation of one MILC lattice QCD application on the cell blade

International Journal of Parallel Programming
State-of-the-art in heterogeneous computing

Scientific Programming
The reverse-acceleration model for programming petascale hybrid systems

IBM Journal of Research and Development
Single-port and multi-port collective communication operations on single and dual Cell BE processor systems

International Journal of Communication Networks and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Cell Broadband Engine shows much promise in high performance computing applications. The Cell is a heterogeneous multicore processor, with the bulk of the computational work load meant to be borne by eight co-processors called SPEs. Each SPE operates on a distinct 256 KB local store, and all the SPEs also have access to a shared 512 MB to 2 GB main memory through DMA. The unconventional architecture of the SPEs, and in particular their small local store, creates some programming challenges. We have provided an implementation of core features of MPI for the Cell to help deal with this. This implementation views each SPE as a node for an MPI process, with the local store used as if it were a cache. In this paper, we describe synchronous mode communication in our implementation, using the rendezvous protocol, which makes MPI communication for long messages efficient. We further present experimental results on the Cell hardware, where it demonstrates good performance, such as throughput up to 6.01 GB/s and latency as low as 0.65 µs on the pingpong test. This demonstrates that it is possible to efficiently implement MPI calls even on the simple SPE cores.