Scalable NIC-based Reduction on Large-scale Clusters

Authors:
Adam Moody;Juan Fernandez;Fabrizio Petrini;Dhabaleswar K. Panda
Affiliations:
The Ohio State University, Columbus;Los Alamos National Laboratory, NM;Los Alamos National Laboratory, NM;The Ohio State University, Columbus
Venue:
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Year:
2003

Citing 17
Cited 15

Fat-trees: universal networks for hardware-efficient supercomputing

IEEE Transactions on Computers
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimal computation of census functions in the postal model

Discrete Applied Mathematics
On the Design and Implementation of Broadcast and Global Combine Operations Using the Postal Model

IEEE Transactions on Parallel and Distributed Systems
Multicasting protocols for high-speed, wormhole-routing local area networks

Conference proceedings on Applications, technologies, architectures, and protocols for computer communications
Predictive performance and scalability modeling of a large-scale application

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Building a high-performance collective communication library

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
The Quadrics Network: High-Performance Clustering Technology

IEEE Micro
Optimal Multicast with Packetization and Network Interface Support

ICPP '97 Proceedings of the international Conference on Parallel Processing
Efficient Multicast on Myrinet using Link-Level Flow Control

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers

Proceedings of the 8th International Symposium on Parallel Processing
Fast NIC-Based Barrier over Myrinet/GM

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages

CANPC '00 Proceedings of the 4th International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
(R) Efficient Reliable Multicast on MYRINET

ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3
LAPACK Working Note 29: On Global Combine Operations

LAPACK Working Note 29: On Global Combine Operations
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing

BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Hardware Acceleration Unit for MPI Queue Processing

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Enhancing NIC Performance for MPI using Processing-in-Memory

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
Message Passing for Linux Clusters with Gigabit Ethernet Mesh Connections

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
An Evaluation of Two Implementation Strategies for Optimizing One-Sided Atomic Reduction

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
Monitoring and Debugging Parallel Software with BCS-MPI on Large-Scale Clusters

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Implications of application usage characteristics for collective communication offload

International Journal of High Performance Computing and Networking
Application-bypass reduction for large-scale clusters

International Journal of High Performance Computing and Networking
ScELA: scalable and extensible launching architecture for clusters

HiPC'08 Proceedings of the 15th international conference on High performance computing
ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Network offloaded hierarchical collectives using ConnectX-2's CORE-Direct capabilities

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Efficient RDMA-based multi-port collectives on multi-rail QsNetII clusters

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Improved point-to-point and collective communication performance with output-queued high-radix routers

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Assessing MPI performance on QsNetIIt

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
The impact of global communication latency at extreme scales on Krylov methods

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many parallel algorithms require efficient reduction collectives. In response, researchers have designed algorithms considering a range of parameters including data size, system size, and communication characteristics. Throughout this past work, however, processing was limited to the host CPU. Today, modern Network Interface Cards (NICs) sport programmable processors with substantial memory, and thus introduce a fresh variable into the equation. In this paper, we investigate this new option in the context of large-scale clusters. Through experiments on the 960-node, 1920-processor ASCI Linux Cluster (ALC) at Lawrence Livermore National Laboratory, we show that NIC-based reductions outperform host-based algorithms in terms of reduced latency and increased consistency. In particular, in the largest configuration tested - 1812 processors - our NIC-based algorithm summed single-element vectors of 32-bit integers and 64-bit floating-point numbers in 73 µs and 118 µs, respectively. These results represent respective improvements of 121% and 39% over the production-level MPI library.