NIC-based reduction algorithms for large-scale clusters

Authors:
Fabrizio Petrini;Adam Moody;Juan Fernandez;Eitan Frachtenberg;Dhabaleswar K. Panda
Affiliations:
Applied Computer Science Group, Pacific Northwest National Laboratory, Richland, WA 99352, USA.;Integrated Computing and Communications Department, Lawrence Livermore National Laboratory, Livermore, CA 94550, USA.;Computer Engineering Department, University of Murcia, 30071 Murcia, Spain.;Computer and Computational Sciences (CCS) Division, Los Alamos National Laboratory, NM 87545, USA.;Department of Computer and Information Science, The Ohio State University, Columbus, OH 43210, USA
Venue:
International Journal of High Performance Computing and Networking
Year:
2006

Citing 18
Cited 1

Fat-trees: universal networks for hardware-efficient supercomputing

IEEE Transactions on Computers
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimal computation of census functions in the postal model

Discrete Applied Mathematics
On the Design and Implementation of Broadcast and Global Combine Operations Using the Postal Model

IEEE Transactions on Parallel and Distributed Systems
Multicasting protocols for high-speed, wormhole-routing local area networks

Conference proceedings on Applications, technologies, architectures, and protocols for computer communications
Predictive performance and scalability modeling of a large-scale application

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Building a high-performance collective communication library

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
The Quadrics Network: High-Performance Clustering Technology

IEEE Micro
Optimal Multicast with Packetization and Network Interface Support

ICPP '97 Proceedings of the international Conference on Parallel Processing
Efficient Multicast on Myrinet using Link-Level Flow Control

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers

Proceedings of the 8th International Symposium on Parallel Processing
Fast NIC-Based Barrier over Myrinet/GM

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages

CANPC '00 Proceedings of the 4th International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
(R) Efficient Reliable Multicast on MYRINET

ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3
Hardware- and Software-Based Collective Communication on the Quadrics Network

NCA '01 Proceedings of the IEEE International Symposium on Network Computing and Applications (NCA'01)
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing

Fast and Efficient Synchronization and Communication Collective Primitives for Dual Cell-Based Blades

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Efficient reduction algorithms are crucial to many large-scale, parallel scientific applications. While previous algorithms constrain processing to the host CPU, we explore and utilise the processors in modern cluster Network Interface Cards (NICs). We present the design issues, solutions, analytical models, and experimental evaluations of a family of NIC-based reduction algorithms. Through experiments on the ALC cluster at Lawrence Livermore National Laboratory, which connects 960 dual-CPU nodes with the Quadrics QsNet interconnect, we find NIC-based reductions to be more efficient than host-based implementations. At large-scale, our NIC-based reductions are more than twice as fast as the host-based, production-level MPI implementation.