International Journal of Parallel Programming
Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Fast parallel algorithms for short-range molecular dynamics
Journal of Computational Physics
Optimizing parallel programs with explicit synchronization
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
High performance synchronization algorithms for multiprogrammed multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Synchronization hardware for networks of workstations: performance vs. cost
ICS '96 Proceedings of the 10th international conference on Supercomputing
Reducing synchronization overhead in parallel simulation
PADS '96 Proceedings of the tenth workshop on Parallel and distributed simulation
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
SLICC: a low latency interface for collective communications
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Building a high-performance collective communication library
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Portals 3.0: Protocol Building Blocks for Low Overhead Communication
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Fast NIC-Based Barrier over Myrinet/GM
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Parallel Processing Support Library Based on Synchronized Aggregate Communication
LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
Application-Bypas Broadcast in MPICH over GM
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Efficient Collective Operations Using Remote Memory Operations on VIA-Based Clusters
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scalable NIC-based Reduction on Large-scale Clusters
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Advanced collective communication in aspen
Proceedings of the 22nd annual international conference on Supercomputing
Leveraging non-blocking collective communication in high-performance applications
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Characteristics of the unexpected message queue of MPI applications
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Scalable memory registration for high performance networks using helper threads
Proceedings of the 8th ACM International Conference on Computing Frontiers
A case for non-blocking collective operations
ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
A fast and resource-conscious MPI message queue mechanism for large-scale jobs
Future Generation Computer Systems
Hi-index | 0.00 |
The global, synchronous nature of some collective operations implies that they will become the bottleneck when scaling to hundreds of thousands of nodes. One approach improves collective performance using a programmable network interface to directly implement collectives. While these implementations improve micro-benchmark performance, accelerating applications will require deeper understanding of application behaviour. We describe several characteristics of applications that impact collective communication performance. We analyse network resource usage data to guide the design of collective offload engines and their associated programming interfaces. In particular, we provide an analysis of the potential benefit of non-blocking collective communication operations for MPI.