rMPI: message passing on multicore processors with on-chip interconnect

Authors:
James Psota;Anant Agarwal
Affiliations:
Massachusetts Institute of Technology, Cambridge, MA;Massachusetts Institute of Technology, Cambridge, MA
Venue:
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Year:
2008

Citing 19
Cited 3

Transputer reference manual

Transputer reference manual
Warp: an integrated solution of high-speed parallel computing

Proceedings of the 1988 ACM/IEEE conference on Supercomputing
An architecture for optimal all-to-all personalized communication

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Software overhead in messaging layers: where does the time go?

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Parallel programming with MPI

Parallel programming with MPI
Space-time scheduling of instruction-level parallelism on a raw machine

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
Tarantula: a vector extension to the alpha architecture

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
MPI: The Complete Reference

MPI: The Complete Reference
A design space evaluation of grid processor architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard
Integrated shared-memory and message-passing communication in the alewife multiprocessor

Integrated shared-memory and message-passing communication in the alewife multiprocessor
WaveScalar

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture
Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP
POWER4 system microarchitecture

IBM Journal of Research and Development

Development process for clusters on a reconfigurable chip

Computers and Electrical Engineering
Adaptive communication mechanism for accelerating MPI functions in NoC-based multicore processors

ACM Transactions on Architecture and Code Optimization (TACO)
An integrated, programming model-driven framework for NoC-QoS support in cluster-based embedded many-cores

Parallel Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

With multicore processors becoming the standard architecture, programmers are faced with the challenge of developing applications that capitalize on multicore's advantages. This paper presents rMPI, which leverages the onchip networks of multicore processors to build a powerful abstraction with which many programmers are familiar: the MPI programming interface. To our knowledge, rMPI is the first MPI implementation for multicore processors that have on-chip networks. This study uses the MIT Raw processor as an experimentation and validation vehicle, although the findings presented are applicable to multicore processors with on-chip networks in general. Likewise, this study uses the MPI API as a general interface which allows parallel tasks to communicate, but the results shown in this paper are generally applicable to message passing communication. Overall, rMPI's design constitutes the marriage of message passing communication and on-chip networks, allowing programmers to employ a well-understood programming model to a high performance multicore processor architecture. This work assesses the applicability of the MPI API to multicore processors with on-chip interconnect, and carefully analyzes overheads associated with common MPI operations. This paper contrasts MPI to lower-overhead network interface abstractions that the on-chip networks provide. The evaluation also compares rMPI to hand-coded applications running directly on one of the processor's lowlevel on-chip networks, as well as to a commercial-quality MPI implementation running on a cluster of Ethernet-connected workstations. Results show speedups of 4x to 15x for 16 processor cores relative to one core, depending on the application, which equal or exceed performance scalability of the MPI cluster system. However, this paper ultimately argues that while MPI offers reasonable performance on multicores when, for instance, legacy applications must be run, its large overheads squander the multicore opportunity. Performance of multicores could be significantly improved by replacing MPI with a lighter-weight communications API with a smaller memory footprint.