A Theory for Total Exchange in Multidimensional Interconnection Networks
IEEE Transactions on Parallel and Distributed Systems
Efficient Communication Using Message Prediction for Cluster Multiprocessors
CANPC '00 Proceedings of the 4th International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Optimizing Compiler for the CELL Processor
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Lazy direct-to-cache transfer during receive operations in a message passing environment
Proceedings of the 3rd conference on Computing frontiers
MPI Microtask for programming the cell broadband engineTM processor
IBM Systems Journal
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
CellSs: a programming model for the cell BE architecture
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Feasibility study of MPI implementation on the heterogeneous multi-core cell BE™ architecture
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Optimization of collective communication in intra-cell MPI
HiPC'07 Proceedings of the 14th international conference on High performance computing
A synchronous mode MPI implementation on the cell BETM architecture
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
Recently, a set of factors has been leading high-performance processor architectures toward designs that feature multiple processing cores on a single chip (a.k.a. CMP). The cell broadband engine (BE) shows potential to provide high-performance to parallel applications (e.g., MPI applications). An efficient implementation of collective communication operations is one of the key issues to reach high-performance and scalability in parallel applications. In this work, we implement several collective communications and investigate their performance in terms of latency and the associated components. For this, broadcast, all-gather and total-exchange functions are implemented on the Cell BE processor.