A synchronous mode MPI implementation on the cell BETM architecture

  • Authors:
  • Murali Krishna;Arun Kumar;Naresh Jayam;Ganapathy Senthilkumar;Pallav K. Baruah;Raghunath Sharma;Shakti Kapoor;Ashok Srinivasan

  • Affiliations:
  • Dept. of Mathematics and Computer Science, Sri Sathya Sai University, Prashanthi Nilayam, India;Dept. of Mathematics and Computer Science, Sri Sathya Sai University, Prashanthi Nilayam, India;Dept. of Mathematics and Computer Science, Sri Sathya Sai University, Prashanthi Nilayam, India;Dept. of Mathematics and Computer Science, Sri Sathya Sai University, Prashanthi Nilayam, India;Dept. of Mathematics and Computer Science, Sri Sathya Sai University, Prashanthi Nilayam, India;Dept. of Mathematics and Computer Science, Sri Sathya Sai University, Prashanthi Nilayam, India;IBM, Austin;Dept. of Computer Science, Florida State University

  • Venue:
  • ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Cell Broadband Engine shows much promise in high performance computing applications. The Cell is a heterogeneous multicore processor, with the bulk of the computational work load meant to be borne by eight co-processors called SPEs. Each SPE operates on a distinct 256 KB local store, and all the SPEs also have access to a shared 512 MB to 2 GB main memory through DMA. The unconventional architecture of the SPEs, and in particular their small local store, creates some programming challenges. We have provided an implementation of core features of MPI for the Cell to help deal with this. This implementation views each SPE as a node for an MPI process, with the local store used as if it were a cache. In this paper, we describe synchronous mode communication in our implementation, using the rendezvous protocol, which makes MPI communication for long messages efficient. We further present experimental results on the Cell hardware, where it demonstrates good performance, such as throughput up to 6.01 GB/s and latency as low as 0.65 µs on the pingpong test. This demonstrates that it is possible to efficiently implement MPI calls even on the simple SPE cores.