Distributed memory matrix-vector multiplication and conjugate gradient algorithms
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
A dominating set model for broadcast in all-port wormhole-routed 2D mesh networks
ICS '94 Proceedings of the 8th international conference on Supercomputing
Static and Run-Time Algorithms for All-to-Many Personalized Communication on Permutation Networks
IEEE Transactions on Parallel and Distributed Systems
Circuit-Switched Broadcasting in Torus Networks
IEEE Transactions on Parallel and Distributed Systems
Proceedings of the 24th annual international symposium on Computer architecture
Efficient Broadcast and Multicast on Multistage Interconnection Networks Using Multiport Encoding
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Building a high-performance collective communication library
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Efficient communication algorithms for pipeline multicomputers
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Faster topology-aware collective algorithms through non-minimal communication
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Hi-index | 0.00 |
In this paper, we disprove the common assumption that the time for broadcasting in a mesh is at best proportional to the square root of the number of processors, at least in the presence of worm-hole routing. We present an optimal algorithm for broadcasting in mesh-connected distributed-memory architectures with worm-hole routing. By organizing the processing nodes in a logical spanning tree, the algorithm executes in time proportional to the logarithm of the number of nodes without inducing contention in the communication network. We restrict the number of nodes in each dimension of the processor mesh to be a power of two. Our method provides insight into how to avoid and/or reduce network contention on meshes for other communication operations. Experimental results on the Intel Touchstone Delta system are included.