The design of a standard message passing interface for distributed memory concurrent computers
Parallel Computing - Special issue: message passing interfaces
Broadcasting on meshes with wormhole routing
Journal of Parallel and Distributed Computing
Optimal Broadcasting in Mesh-Connected Architectures
Optimal Broadcasting in Mesh-Connected Architectures
Multiple Multicast with Minimized Node Contention on Wormhole k-ary n-cube Networks
IEEE Transactions on Parallel and Distributed Systems
The Journal of Supercomputing
Supporting dynamic parallel object arrays
Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande
ECO: Efficient Collective Operations for Communication on Heterogeneous Networks
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Scaling the unscalable: a case study on the AlphaServer SC
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance Analysis of a Myrinet-Based Cluster
Cluster Computing
Efficient implementation of reduce-scatter in MPI
Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Parallel, distributed and network-based processing
Scalable NIC-based Reduction on Large-scale Clusters
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance Modeling and Tuning Strategies of Mixed Mode Collective Communications
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A Reconfigurable MPI Broadcast Function
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Self-adapting numerical software (SANS) effort
IBM Journal of Research and Development
Implications of application usage characteristics for collective communication offload
International Journal of High Performance Computing and Networking
NIC-based reduction algorithms for large-scale clusters
International Journal of High Performance Computing and Networking
Optimal broadcast for fully connected processor-node networks
Journal of Parallel and Distributed Computing
MPI Applications on Grids: A Topology Aware Approach
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A configurable algorithm for parallel image-compositing applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Two-tree algorithms for full bandwidth broadcast, reduction and scan
Parallel Computing
Dynamic Load-Balanced Multicast for Data-Intensive Applications on Clouds
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers
Software—Practice & Experience - Focus on Selected PhD Literature Reviews in the Practical Aspects of Software Technology
Efficient implementation of reduce-scatter in MPI
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Cache injection for parallel applications
Proceedings of the 20th international symposium on High performance distributed computing
A proposal of reconfigurable MPI collective communication functions
ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
In this paper, we report on a project to develop a unified approach for building a library of collective communication operations that performs well on a cross-section of problems encountered in real applications. The target architecture is a two-dimensional mesh with worm-hole routing, but the techniques are more general. The approach differs from traditional library implementations in that we address the need for implementations that perform well for various sized vectors and grid dimensions, including non-power-of-two grids. We show how a general approach to hybrid algorithms yields performance across the entire range of vector lengths. Moreover, many scalable implementations of application libraries require collective communication within groups of nodes. Our approach yields the same kind of performance for group collective communication. Results from the Intel Paragon system are included. To obtain this library for Intel systems contact intercom©cs.utexas.edu.