Implications of application usage characteristics for collective communication offload

Authors:
Ron Brightwell;Sue P. Goudy;Arun Rodrigues;Keith D. Underwood
Affiliations:
Sandia National Laboratories, P.O. Box 5800, MS-1110 Albuquerque, NM 87185-1110, USA.;Sandia National Laboratories, P.O. Box 5800, MS-1110 Albuquerque, NM 87185-1110, USA.;Sandia National Laboratories, P.O. Box 5800, MS-1110 Albuquerque, NM 87185-1110, USA.;Sandia National Laboratories, P.O. Box 5800, MS-1110 Albuquerque, NM 87185-1110, USA
Venue:
International Journal of High Performance Computing and Networking
Year:
2006

Citing 22
Cited 8

The butterfly barrier

International Journal of Parallel Programming
Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Fast parallel algorithms for short-range molecular dynamics

Journal of Computational Physics
Optimizing parallel programs with explicit synchronization

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
High performance synchronization algorithms for multiprogrammed multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Synchronization hardware for networks of workstations: performance vs. cost

ICS '96 Proceedings of the 10th international conference on Supercomputing
Reducing synchronization overhead in parallel simulation

PADS '96 Proceedings of the tenth workshop on Parallel and distributed simulation
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
SLICC: a low latency interface for collective communications

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Building a high-performance collective communication library

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
Portals 3.0: Protocol Building Blocks for Low Overhead Communication

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Fast NIC-Based Barrier over Myrinet/GM

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Parallel Processing Support Library Based on Synchronized Aggregate Communication

LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
Application-Bypas Broadcast in MPICH over GM

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Bitwise Aggregate Networks

SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Efficient Collective Operations Using Remote Memory Operations on VIA-Based Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A Case for Aggregate Networks

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scalable NIC-based Reduction on Large-scale Clusters

Proceedings of the 2003 ACM/IEEE conference on Supercomputing

Advanced collective communication in aspen

Proceedings of the 22nd annual international conference on Supercomputing
Leveraging non-blocking collective communication in high-performance applications

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Characteristics of the unexpected message queue of MPI applications

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Scalable memory registration for high performance networks using helper threads

Proceedings of the 8th ACM International Conference on Computing Frontiers
A case for non-blocking collective operations

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Tuning collective communication for Partitioned Global Address Space programming models

Parallel Computing
Significantly reducing MPI intercommunication latency and power overhead in both embedded and HPC systems

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
A fast and resource-conscious MPI message queue mechanism for large-scale jobs

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The global, synchronous nature of some collective operations implies that they will become the bottleneck when scaling to hundreds of thousands of nodes. One approach improves collective performance using a programmable network interface to directly implement collectives. While these implementations improve micro-benchmark performance, accelerating applications will require deeper understanding of application behaviour. We describe several characteristics of applications that impact collective communication performance. We analyse network resource usage data to guide the design of collective offload engines and their associated programming interfaces. In particular, we provide an analysis of the potential benefit of non-blocking collective communication operations for MPI.