Enhancing NIC Performance for MPI using Processing-in-Memory

Authors:
Arun Rodrigues;Richard Murphy;Ron Brightwell;Keith D. Underwood
Affiliations:
Sandia National Laboratories, Albuquerque, NM;Sandia National Laboratories, Albuquerque, NM;Sandia National Laboratories, Albuquerque, NM;Sandia National Laboratories, Albuquerque, NM
Venue:
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
Year:
2005

Citing 18
Cited 2

LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Effects of communication latency, overhead, and bandwidth in a cluster architecture

Proceedings of the 24th annual international symposium on Computer architecture
Microservers: a new memory semantics for massively parallel computing

ICS '99 Proceedings of the 13th international conference on Supercomputing
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
A Case for Intelligent RAM

IEEE Micro
The Quadrics Network: High-Performance Clustering Technology

IEEE Micro
Fast NIC-Based Barrier over Myrinet/GM

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Support for MPI at the Network Interface Level

Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Gilgamesh: a multithreaded processor-in-memory architecture for petaflops computing

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Experience in Offloading Protocol Processing to a Programmable NIC

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Pursuing a Petaflop: Point Designs for 100 TF Computers Using PIM Technologies

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
FlexRAM: Toward an Advanced Intelligent Memory System

ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
Trading Bandwidth for Latency: Managing Continuations Through a Carpet Bag Cache

IWIA '02 Proceedings of the International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'02)
The Impact of MPI Queue Usage on Message Latency

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Scalable NIC-based Reduction on Large-scale Clusters

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Hardware Acceleration Unit for MPI Queue Processing

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A low cost, multithreaded processing-in-memory system

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture

A Hardware Acceleration Unit for MPI Queue Processing

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Coprocessor design to support MPI primitives in configurable multiprocessors

Integration, the VLSI Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Processing-in-Memory (PIM) technology encompasses a range of research leveraging a tight coupling of memory and processing. The most unique features of the technology are extremely wide paths to memory, extremely low memory latency, and wide functional units. Many PIM researchers are also exploring extremely fine-grained multi-threading capabilities. This paper explores a mechanism for leveraging these features of PIM technology to enhance commodity architectures in a seemingly mundane way: accelerating MPI. Modern network interfaces leverage simple processors to offload portions of the MPI semantics. particularly the management of posted receive and unexpected message queues. Without adding cost or increasing clock frequency, using PIMs in the network interface can enhance performance. The results are a significant decrease in latency and increase in small message bandwidth, particularly when long queues are present.