Significantly reducing MPI intercommunication latency and power overhead in both embedded and HPC systems

Authors:
Pavlos M. Mattheakis;Ioannis Papaefstathiou
Affiliations:
Telecommunication Systems Institute, Technical University of Crete, and University of Crete, Heraklion, Greece;Synelixis Solutions Ltd, Chalkida, Greece
Venue:
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Year:
2013

Citing 15
Cited 0

The NAS parallel benchmarks—summary and preliminary results

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms

The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
The Quadrics Network: High-Performance Clustering Technology

IEEE Micro
MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard
Transaction level modeling: an overview

Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
The Impact of MPI Queue Usage on Message Latency

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
A Hardware Acceleration Unit for MPI Queue Processing

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Analyzing the Impact of Overlap, Offload, and Independent Progress for Message Passing Interface Applications

International Journal of High Performance Computing Applications
A Preliminary Analysis of the MPI Queue Characteristics of Several Applications

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Implications of application usage characteristics for collective communication offload

International Journal of High Performance Computing and Networking
An architecture to perform NIC based MPI matching

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Network Interface Architecture for Scalable Message Queue Processing

ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Characteristics of the unexpected message queue of MPI applications

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
The future of microprocessors

Communications of the ACM
Open MPI: a flexible high performance MPI

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Highly parallel systems are becoming mainstream in a wide range of sectors ranging from their traditional stronghold high-performance computing, to data centers and even embedded systems. However, despite the quantum leaps of improvements in cost and performance of individual components over the last decade (e.g., processor speeds, memory/interconnection bandwidth, etc.), system manufacturers are still struggling to deliver low-latency, highly scalable solutions. One of the main reasons is that the intercommunication latency grows significantly with the number of processor nodes. This article presents a novel way to reduce this intercommunication delay by implementing, in custom hardware, certain communication tasks. In particular, the proposed novel device implements the two most widely used procedures of the most popular communication protocol in parallel systems the Message Passing Interface (MPI). Our novel approach has initially been simulated within a pioneering parallel systems simulation framework and then synthesized directly from a high-level description language (i.e., SystemC) using a state-of-the-art synthesis tool. To the best of our knowledge, this is the first article presenting the complete hardware implementation of such a system. The proposed novel approach triggers a speedup from one to four orders of magnitude when compared with conventional software-based solutions and from one to three orders of magnitude when compared with a sophisticated software-based approach. Moreover, the performance of our system is from one to two orders of magnitude higher than the simulated performance of a similar but, relatively simpler hardware architecture; at the same time the power consumption of our device is about two orders of magnitude lower than that of a low-power CPU when executing the exact same intercommunication tasks.