High-performance RMA-based broadcast on the intel SCC

Authors:
Darko Petrović;Omid Shahmirzadi;Thomas Ropars;André Schiper
Affiliations:
EPFL, Lausanne, Switzerland;EPFL, Lausanne, Switzerland;EPFL, Lausanne, Switzerland;EPFL, Lausanne, Switzerland
Venue:
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Year:
2012

Citing 15
Cited 2

LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
Assessing Fast Network Interfaces

IEEE Micro
High performance RDMA-based MPI implementation over InfiniBand

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Efficient Collective Operations Using Remote Memory Operations on VIA-Based Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Broadcast Trees for Heterogeneous Platforms

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Optimization of MPI collective communication on BlueGene/L systems

Proceedings of the 19th annual international conference on Supercomputing
Thousand core chips: a technology perspective

Proceedings of the 44th annual Design Automation Conference
Programming the Intel 80-core network-on-a-chip terascale processor

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
The multikernel: a new OS architecture for scalable multicore systems

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Architectures for Extreme-Scale Computing

Computer
BatchQueue: Fast and Memory-Thrifty Core to Core Communication

SBAC-PAD '10 Proceedings of the 2010 22nd International Symposium on Computer Architecture and High Performance Computing
Light-weight communications on Intel's single-chip cloud computer processor

ACM SIGOPS Operating Systems Review
RCKMPI - lightweight MPI implementation for intel's single-chip cloud computer (SCC)

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
High performance RDMA based all-to-all broadcast for infiniband clusters

HiPC'05 Proceedings of the 12th international conference on High Performance Computing

Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Adaptive communication mechanism for accelerating MPI functions in NoC-based multicore processors

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many-core chips with more than 1000 cores are expected by the end of the decade. To overcome scalability issues related to cache coherence at such a scale, one of the main research directions is to leverage the message-passing programming model. The Intel Single-Chip Cloud Computer (SCC) is a prototype of a message-passing many-core chip. It offers the ability to move data between on-chip Message Passing Buffers (MPB) using Remote Memory Access (RMA). Performance of message-passing applications is directly affected by efficiency of collective operations, such as broadcast. In this paper, we study how to make use of the MPBs to implement an efficient broadcast algorithm for the SCC. We propose OC-Bcast (On-Chip Broadcast), a pipelined k-ary tree algorithm tailored to exploit the parallelism provided by on-chip RMA. Using a LogP-based model, we present an analytical evaluation that compares our algorithm to the state-of-the-art broadcast algorithms implemented for the SCC. As predicted by the model, experimental results show that OC-Bcast attains almost three times better throughput, and improves latency by at least 27\%. Furthermore, the analytical evaluation highlights the benefits of our approach: OC-Bcast takes direct advantage of RMA, unlike the other considered broadcast algorithms, which are based on a higher-level send/receive interface. This leads us to the conclusion that RMA-based collective operations are needed to take full advantage of hardware features of future message-passing many-core architectures.