Fat-trees: universal networks for hardware-efficient supercomputing
IEEE Transactions on Computers
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimal computation of census functions in the postal model
Discrete Applied Mathematics
On the Design and Implementation of Broadcast and Global Combine Operations Using the Postal Model
IEEE Transactions on Parallel and Distributed Systems
Multicasting protocols for high-speed, wormhole-routing local area networks
Conference proceedings on Applications, technologies, architectures, and protocols for computer communications
Predictive performance and scalability modeling of a large-scale application
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Building a high-performance collective communication library
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Optimal Multicast with Packetization and Network Interface Support
ICPP '97 Proceedings of the international Conference on Parallel Processing
Efficient Multicast on Myrinet using Link-Level Flow Control
ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers
Proceedings of the 8th International Symposium on Parallel Processing
Fast NIC-Based Barrier over Myrinet/GM
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages
CANPC '00 Proceedings of the 4th International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
(R) Efficient Reliable Multicast on MYRINET
ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3
Hardware- and Software-Based Collective Communication on the Quadrics Network
NCA '01 Proceedings of the IEEE International Symposium on Network Computing and Applications (NCA'01)
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Hi-index | 0.00 |
Efficient reduction algorithms are crucial to many large-scale, parallel scientific applications. While previous algorithms constrain processing to the host CPU, we explore and utilise the processors in modern cluster Network Interface Cards (NICs). We present the design issues, solutions, analytical models, and experimental evaluations of a family of NIC-based reduction algorithms. Through experiments on the ALC cluster at Lawrence Livermore National Laboratory, which connects 960 dual-CPU nodes with the Quadrics QsNet interconnect, we find NIC-based reductions to be more efficient than host-based implementations. At large-scale, our NIC-based reductions are more than twice as fast as the host-based, production-level MPI implementation.