Feasibility study of MPI implementation on the heterogeneous multi-core cell BE™ architecture

  • Authors:
  • Arun Kumar;Naresh Jayam;Ashok Srinivasan;Ganapathy Senthilkumar;Pallav K. Baruah;Shakti Kapoor;Murali Krishna;Raghunath Sarma

  • Affiliations:
  • Sri Sathya Sai University, Prasanthi Nilayam, India;Sri Sathya Sai University, Prasanthi Nilayam, India;Florida State University, Tallahassee, Florida;Sri Sathya Sai University, Prasanthi Nilayam, India;Sri Sathya Sai University, Prasanthi Nilayam, India;IBM, Austin, Texas;Sri Sathya Sai University, Prasanthi Nilayam, India;Sri Sathya Sai University, Prasanthi Nilayam, India

  • Venue:
  • Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Cell Broadband Engine™ is a new heterogeneous multi-core processor from IBM, Sony, and Toshiba. It contains eight co-processors, called Synergistic Processing Elements (SPEs), which operate directly on distinct 256 KB local stores, and also have access to a shared 512 MB to 2 GB main memory. The combined peak speed of the SPEs is 204.8 Gflop/s in single precision and 14.64 Gflop/s in double precision. There is, therefore, much interest in using the Cell BE™ for high performance computing applications. However, the unconventional architecture of the SPEs, in particular their local stores, creates some programming challenges. We describe our implementation of certain core features of MPI, such as blocking point-to-point calls and collective communication calls, which can help meet these challenges, by enabling a large class of MPI applications to be ported to the Cell BE™ processor. This implementation views each SPE as a node for an MPI process. We store the application data in main memory in order to avoid being limited by the local store size. The local store is abstracted in the library and thus hidden from the application with respect to MPI calls. We have achieved bandwidth up to 6.01 GB/s and latency as low as 0.41 ms on the ping-pong test. The contribution of this work lies in (i) demonstrating that the Cell BE™ has good potential for running intra-Cell BE™ MPI applications, (ii) enabling such applications to be ported to the Cell BE™ with minimal effort, and (iii) evaluating the performance impact of different design choices.