Fast and Efficient Synchronization and Communication Collective Primitives for Dual Cell-Based Blades

Authors:
Epifanio Gaona;Juan Fernández;Manuel E. Acacio
Affiliations:
Dept. de Ingeniería y Tecnología de Computadores, Universidad de Murcia, Spain;Dept. de Ingeniería y Tecnología de Computadores, Universidad de Murcia, Spain;Dept. de Ingeniería y Tecnología de Computadores, Universidad de Murcia, Spain
Venue:
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Year:
2009

Citing 10
Cited 1

MPI Microtask for programming the cell broadband engineTM processor

IBM Systems Journal
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
CellSs: a programming model for the cell BE architecture

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Cell/B.E. blades: building blocks for scalable, real-time, interactive, and digital media servers

IBM Journal of Research and Development
NIC-based reduction algorithms for large-scale clusters

International Journal of High Performance Computing and Networking
A Buffered-Mode MPI Implementation for the Cell BETM Processor

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Supporting OpenMP on cell

International Journal of Parallel Programming
Optimization of collective communication in intra-cell MPI

HiPC'07 Proceedings of the 14th international conference on High performance computing

Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Cell Broadband Engine (Cell BE) is a heterogeneous multi-core processor specifically designed to exploit thread-level parallelism. Its memory model comprehends a common shared main memory and eight small private local memories. Programming of the Cell BE involves dealing with multiple threads and explicit data movement strategies through DMAs which make the task very challenging. This situation gets even worse when dual Cell-based blades are considered. In this context, fast and efficient collective primitives are indispensable to reduce complexity and optimize performance. In this paper, we describe the design and implementation of three collective operations: barrier, broadcast and reduce. Their design takes into consideration the architectural peculiarities and asymmetries of dual Cell-based blades. Meanwhile, their implementation requires minimal resources, a signal register and a buffer. Experimental results show low latencies and high bandwidths, synchronization latency of 637 ns, broadcast bandwidth of 38.33 GB/s for 16 KB messages, and reduce latency of 1535 ns with 32 floats , on a dual Cell-based blade with 16 SPEs.