The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Design and Evaluation of Nemesis, a Scalable, Low-Latency, Message-Passing Communication Subsystem
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Implementation and shared-memory evaluation of MPICH2 over the nemesis communication subsystem
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
SPENK: adding another level of parallelism on the cell broadband engine
IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Nested parallelism for multi-core HPC systems using Java
Journal of Parallel and Distributed Computing
International Journal of Communication Networks and Distributed Systems
A synchronous mode MPI implementation on the cell BETM architecture
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
The Cell Broadband Engine™ is a new heterogeneous multi-core processor from IBM, Sony, and Toshiba. It contains eight co-processors, called Synergistic Processing Elements (SPEs), which operate directly on distinct 256 KB local stores, and also have access to a shared 512 MB to 2 GB main memory. The combined peak speed of the SPEs is 204.8 Gflop/s in single precision and 14.64 Gflop/s in double precision. There is, therefore, much interest in using the Cell BE™ for high performance computing applications. However, the unconventional architecture of the SPEs, in particular their local stores, creates some programming challenges. We describe our implementation of certain core features of MPI, such as blocking point-to-point calls and collective communication calls, which can help meet these challenges, by enabling a large class of MPI applications to be ported to the Cell BE™ processor. This implementation views each SPE as a node for an MPI process. We store the application data in main memory in order to avoid being limited by the local store size. The local store is abstracted in the library and thus hidden from the application with respect to MPI calls. We have achieved bandwidth up to 6.01 GB/s and latency as low as 0.41 ms on the ping-pong test. The contribution of this work lies in (i) demonstrating that the Cell BE™ has good potential for running intra-Cell BE™ MPI applications, (ii) enabling such applications to be ported to the Cell BE™ with minimal effort, and (iii) evaluating the performance impact of different design choices.